dtrain/README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

NOTES
 learner gets all used features (binary! and dense (logprob is sum of logprobs!))
 weights: see decoder/decoder.cc line 548
 (40k sents, k=100 = ~400M mem, 1 iteration 45min)?
 utils/weights.cc: why wv_?
 FD, Weights::wv_ grow too large, see utils/weights.cc;
     decoder/hg.h; decoder/scfg_translator.cc; utils/fdict.cc

TODO
 enable kbest FILTERING (nofiler vs unique)
 MULTIPARTITE ranking (108010, 1 vs all, cluster modelscore;score)
 what about RESCORING?
 REMEMBER kbest (merge) weights?
 SELECT iteration with highest (real) BLEU?
 GENERATED data? (multi-task, ability to learn, perfect translation in nbest, at first all modelscore 1)
 CACHING (ngrams for scoring)
 hadoop PIPES imlementation
 SHARED LM (kenlm actually does this!)?
 ITERATION variants
  once -> average
  shuffle resulting weights
 weights AVERAGING in reducer (global Ngram counts)
 BATCH implementation (no update after each Kbest list)
 set REFERENCE for cdec (rescoring)?
 MORE THAN ONE reference for BLEU?
 kbest NICER (do not iterate twice)!? -> shared_ptr?
 DO NOT USE Decoder::Decode (input caching as WordID)!?
  sparse vector instead of vector<double> for weights in Decoder(::SetWeights)?
 reactivate DTEST and tests
 non deterministic, high variance, RANDOM RESTARTS
 use separate TEST SET

KNOWN BUGS PROBLEMS
 cdec kbest vs 1best (no -k param), rescoring? => ok(?)
 no sparse vector in decoder => ok
 ? ok
 sh: error while loading shared libraries: libreadline.so.6: cannot open shared object file: Error 24
 PhraseModel_* features (0..99 seem to be generated, why 99?)
 flex scanner jams on malicious input, we could skip that