From 628c4ecb641096c6526c7e6062460e627433f8fa Mon Sep 17 00:00:00 2001
From: Patrick Simianer
Date: Fri, 14 Oct 2011 15:40:23 +0200
Subject: test
---
dtrain/README | 36 ------------------------------------
dtrain/README.md | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 36 deletions(-)
delete mode 100644 dtrain/README
create mode 100644 dtrain/README.md
(limited to 'dtrain')
diff --git a/dtrain/README b/dtrain/README
deleted file mode 100644
index 997c5ff3..00000000
--- a/dtrain/README
+++ /dev/null
@@ -1,36 +0,0 @@
-TODO
- MULTIPARTITE ranking (108010, 1 vs all, cluster modelscore;score)
- what about RESCORING?
- REMEMBER kbest (merge) weights?
- SELECT iteration with highest (real) BLEU?
- GENERATED data? (multi-task, ability to learn, perfect translation in nbest, at first all modelscore 1)
- CACHING (ngrams for scoring)
- hadoop PIPES imlementation
- SHARED LM (kenlm actually does this!)?
- ITERATION variants
- once -> average
- shuffle resulting weights
- weights AVERAGING in reducer (global Ngram counts)
- BATCH implementation (no update after each Kbest list)
- set REFERENCE for cdec (rescoring)?
- MORE THAN ONE reference for BLEU?
- kbest NICER (do not iterate twice)!? -> shared_ptr?
- DO NOT USE Decoder::Decode (input caching as WordID)!?
- sparse vector instead of vector for weights in Decoder(::SetWeights)?
- reactivate DTEST and tests
- non deterministic, high variance, RANDOM RESTARTS
- use separate TEST SET
-
-KNOWN BUGS, PROBLEMS
- doesn't select best iteration for weigts
- if size of candidate < N => 0 score
- cdec kbest vs 1best (no -k param), rescoring? => ok(?)
- no sparse vector in decoder => ok
- ? ok
- sh: error while loading shared libraries: libreadline.so.6: cannot open shared object file: Error 24
- PhraseModel_* features (0..99 seem to be generated, why 99?)
- flex scanner jams on malicious input, we could skip that
-
-FIX
- merge
- ep data
diff --git a/dtrain/README.md b/dtrain/README.md
new file mode 100644
index 00000000..dc980faf
--- /dev/null
+++ b/dtrain/README.md
@@ -0,0 +1,50 @@
+IDEAS
+=====
+ MULTIPARTITE ranking (108010, 1 vs all, cluster modelscore;score)
+ what about RESCORING?
+ REMEMBER kbest (merge) weights?
+ SELECT iteration with highest (real) BLEU?
+ GENERATED data? (multi-task, ability to learn, perfect translation in nbest, at first all modelscore 1)
+ CACHING (ngrams for scoring)
+ hadoop PIPES imlementation
+ SHARED LM (kenlm actually does this!)?
+ ITERATION variants
+ once -> average
+ shuffle resulting weights
+ weights AVERAGING in reducer (global Ngram counts)
+ BATCH implementation (no update after each Kbest list)
+ set REFERENCE for cdec (rescoring)?
+ MORE THAN ONE reference for BLEU?
+ kbest NICER (do not iterate twice)!? -> shared_ptr?
+ DO NOT USE Decoder::Decode (input caching as WordID)!?
+ sparse vector instead of vector for weights in Decoder(::SetWeights)?
+ reactivate DTEST and tests
+ non deterministic, high variance, RANDOM RESTARTS
+ use separate TEST SET
+
+Uncertain, known bugs, problems
+===============================
+* cdec kbest vs 1best (no -k param), rescoring? => ok(?)
+* no sparse vector in decoder => ok/fixed
+* PhraseModel_* features (0..99 seem to be generated, why 99?)
+* flex scanner jams on malicious input, we could skip that
+
+FIXME
+=====
+* merge
+* ep data
+
+Data
+====
+
+nc-v6.de-en peg
+nc-v6.de-en.loo peg
+nc-v6.de-en.giza.loo peg
+nc-v6.de-en.symgiza.loo pe
+nv-v6.de-en.cs pe
+nc-v6.de-en.cs.loo pe
+--
+ep-v6.de-en.cs p
+ep-v6.de-en.cs.loo p
+
+
--
cgit v1.2.3