From 0c28b8dc375722c631486377217c6c8a6a362b5a Mon Sep 17 00:00:00 2001 From: Patrick Simianer Date: Sat, 29 Oct 2011 19:56:55 +0200 Subject: README --- dtrain/README.md | 64 +++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 54 insertions(+), 10 deletions(-) (limited to 'dtrain/README.md') diff --git a/dtrain/README.md b/dtrain/README.md index bc96ed18..3d09393c 100644 --- a/dtrain/README.md +++ b/dtrain/README.md @@ -45,7 +45,10 @@ FIXME ----- merge dtrain part-* files mapred count shard sents - +250 forest sampling is real bad, bug? +kenlm not portable (i7-2620M vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz) +metric reporter of bleu for each shard +mapred chaining? hamake? Data ---- @@ -66,16 +69,57 @@ p: prep, e: extract, g: grammar, d: dtrain Experiments ----------- -features - TODO +grammar stats + oov on dev/devtest/test + size + #rules (uniq) + time for building + ep: 1.5 days on 278 slots (30 nodes) + nc: ~2 hours ^^^ + +lm stats + oov on dev/devtest/test + perplex on train/dev/devtest/test + +which word alignment? + berkeleyaligner + giza++ as of Sep 24 2011, mgizapp 0.6.3 + symgiza as of Oct 1 2011 + --- + randomly sample 100 from train.loo + run mira/dtrain for 50/60 iterations + w/o lm, wp + measure ibm_bleu on exact same sents + +stability + mira: 100 + pro: 100 + vest: 100 + dtrain: 100 + + +pro metaparam + (max) iter + regularization + +mira metaparam + (max) iter: 10 (nc???) vs 15 (ep???) -"lm open better than lm closed when tuned" +lm? + 3-4-5 + open + unk + nounk (-100 for unk) + -- + tune or not??? + lm oov weight pos? -mira100-10 -mira100-17 +features to try + NgramFeatures -> target side ngrams + RuleIdentityFeatures + RuleNgramFeatures -> source side ngrams from rule + RuleShape -> relative orientation of X's and terminals + SpanFeatures -> http://www.cs.cmu.edu/~cdyer/wmt11-sysdesc.pdf + ArityPenalty -> Arity=0 Arity=1 and Arity=2 -baselines - mira - pro - vest -- cgit v1.2.3