added READMEs

author: Patrick Simianer <p@simianer.de> 2011-09-05 20:26:59 +0200
committer: Patrick Simianer <p@simianer.de> 2011-09-23 19:13:58 +0200
commit: cbbee18e49d3ae60e0fbb0f308694b8426620695 (patch)
tree: ad937607dc4514992433b94b4a210f8a2efe6767
parent: ed3df1e4e7a2b4e7d6a5ebc4e47d2b0231dc5f21 (diff)
2 files changed, 44 insertions, 0 deletions
diff --git a/dtrain/README b/dtrain/README
new file mode 100644
index 00000000..74bac6a0
--- /dev/null
+++ b/dtrain/README
@@ -0,0 +1,39 @@
+NOTES
+ learner gets all used features (binary! and dense (logprob is sum of logprobs!))
+ weights: see decoder/decoder.cc line 548
+ 40k sents, k=100 = ~400M mem, 1 iteration 45min
+ utils/weights.cc: why wv_?
+ FD, Weights::wv_ grow too large, see utils/weights.cc;
+     decoder/hg.h; decoder/scfg_translator.cc; utils/fdict.cc
+
+TODO
+ enable kbest FILTERING (nofiler vs unique)
+ MULTIPARTITE ranking (108010, 1 vs all, cluster modelscore;score)
+ what about RESCORING?
+ REMEMBER kbest (merge) weights?
+ SELECT iteration with highest (real) BLEU?
+ GENERATED data? (multi-task, ability to learn, perfect translation in nbest, at first all modelscore 1)
+ CACHING (ngrams for scoring)
+ hadoop PIPES imlementation
+ SHARED LM?
+ ITERATION variants
+  once -> average
+  shuffle resulting weights
+ weights AVERAGING in reducer (global Ngram counts)
+ BATCH implementation (no update after each Kbest list)
+ SOFIA --eta_type explicit
+ set REFERENCE for cdec (rescoring)?
+ MORE THAN ONE reference for BLEU?
+ kbest NICER (do not iterate twice)!? -> shared_ptr?
+ DO NOT USE Decoder::Decode (input caching as WordID)!?
+  sparse vector instead of vector<double> for weights in Decoder(::SetWeights)?
+ reactivate DTEST and tests
+ non deterministic, high variance, RANDOWM RESTARTS
+ use separate TEST SET
+
+KNOWN BUGS PROBLEMS
+ does probably OVERFIT
+ cdec kbest vs 1best (no -k param) fishy!
+ sh: error while loading shared libraries: libreadline.so.6: cannot open shared object file: Error 24
+ PhraseModel_* features (0..99 seem to be generated, default?)
+
diff --git a/dtrain/test/EXAMPLE/README b/dtrain/test/EXAMPLE/README
new file mode 100644
index 00000000..5ce7cd67
--- /dev/null
+++ b/dtrain/test/EXAMPLE/README
@@ -0,0 +1,5 @@
+run with (from cdec/dtrain)
+./dtrain -c test/EXAMPLE/dtrain.ini
+(dowload http://hadoop.cl.uni-heidelberg.de/mtm/dtrain.nc-1k.gz and ungzip into test/EXAMPLE)
+Note: sofia-ml binary needs to be in cdec/dtrain/
+
author	Patrick Simianer <p@simianer.de>	2011-09-05 20:26:59 +0200
committer	Patrick Simianer <p@simianer.de>	2011-09-23 19:13:58 +0200
commit	cbbee18e49d3ae60e0fbb0f308694b8426620695 (patch)
tree	ad937607dc4514992433b94b4a210f8a2efe6767
parent	ed3df1e4e7a2b4e7d6a5ebc4e47d2b0231dc5f21 (diff)