diff options
author | Patrick Simianer <p@simianer.de> | 2011-09-05 20:26:59 +0200 |
---|---|---|
committer | Patrick Simianer <p@simianer.de> | 2011-09-23 19:13:58 +0200 |
commit | bcf45fc73bd855a3003dee7a8a0b7551eeb0523b (patch) | |
tree | 7da0fb24817408bdd5c16d9705487918a1a0bfd9 | |
parent | 974e485f9231b3d11edbd9e538ffa20b45e7435a (diff) |
added READMEs
-rw-r--r-- | dtrain/README | 39 | ||||
-rw-r--r-- | dtrain/test/EXAMPLE/README | 5 |
2 files changed, 44 insertions, 0 deletions
diff --git a/dtrain/README b/dtrain/README new file mode 100644 index 00000000..74bac6a0 --- /dev/null +++ b/dtrain/README @@ -0,0 +1,39 @@ +NOTES + learner gets all used features (binary! and dense (logprob is sum of logprobs!)) + weights: see decoder/decoder.cc line 548 + 40k sents, k=100 = ~400M mem, 1 iteration 45min + utils/weights.cc: why wv_? + FD, Weights::wv_ grow too large, see utils/weights.cc; + decoder/hg.h; decoder/scfg_translator.cc; utils/fdict.cc + +TODO + enable kbest FILTERING (nofiler vs unique) + MULTIPARTITE ranking (108010, 1 vs all, cluster modelscore;score) + what about RESCORING? + REMEMBER kbest (merge) weights? + SELECT iteration with highest (real) BLEU? + GENERATED data? (multi-task, ability to learn, perfect translation in nbest, at first all modelscore 1) + CACHING (ngrams for scoring) + hadoop PIPES imlementation + SHARED LM? + ITERATION variants + once -> average + shuffle resulting weights + weights AVERAGING in reducer (global Ngram counts) + BATCH implementation (no update after each Kbest list) + SOFIA --eta_type explicit + set REFERENCE for cdec (rescoring)? + MORE THAN ONE reference for BLEU? + kbest NICER (do not iterate twice)!? -> shared_ptr? + DO NOT USE Decoder::Decode (input caching as WordID)!? + sparse vector instead of vector<double> for weights in Decoder(::SetWeights)? + reactivate DTEST and tests + non deterministic, high variance, RANDOWM RESTARTS + use separate TEST SET + +KNOWN BUGS PROBLEMS + does probably OVERFIT + cdec kbest vs 1best (no -k param) fishy! + sh: error while loading shared libraries: libreadline.so.6: cannot open shared object file: Error 24 + PhraseModel_* features (0..99 seem to be generated, default?) + diff --git a/dtrain/test/EXAMPLE/README b/dtrain/test/EXAMPLE/README new file mode 100644 index 00000000..5ce7cd67 --- /dev/null +++ b/dtrain/test/EXAMPLE/README @@ -0,0 +1,5 @@ +run with (from cdec/dtrain) +./dtrain -c test/EXAMPLE/dtrain.ini +(dowload http://hadoop.cl.uni-heidelberg.de/mtm/dtrain.nc-1k.gz and ungzip into test/EXAMPLE) +Note: sofia-ml binary needs to be in cdec/dtrain/ + |