blob: 4ed31d43134e125ca53de7ea226c3095db773eab (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
|
dtrain
======
Build & run
-----------
build ..
<pre>
git clone git://github.com/qlt/cdec-dtrain.git
cd cdec_dtrain
autoreconf -ifv
./configure
make
</pre>
and run:
<pre>
cd dtrain/hstreaming/
(edit ini files)
edit hadoop-streaming-job.sh $IN and $OUT
./hadoop-streaming-job.sh
</pre>
Ideas
-----
* *MULTIPARTITE* ranking (1 vs all, cluster model/score)
* *REMEMBER* sampled translations (merge)
* *SELECT* iteration with highest (_real_) BLEU?
* *GENERATED* data? (perfect translation in kbest)
* *CACHING* (ngrams for scoring)
* hadoop *PIPES* imlementation
* *ITERATION* variants (shuffle resulting weights, re-iterate)
* *MORE THAN ONE* reference for BLEU?
* *RANDOM RESTARTS*
* use separate TEST SET for each shard
Uncertain, known bugs, problems
-------------------------------
* cdec kbest vs 1best (no -k param), rescoring (ref?)? => ok(?)
* no sparse vector in decoder => ok/fixed
* PhraseModel_* features (0..99 seem to be generated, why 99?)
* flex scanner jams on malicious input, we could skip that
* input/grammar caching (strings, files)
FIXME
-----
* merge with cdec master
Data
----
<pre>
nc-v6.de-en peg
nc-v6.de-en.loo peg
nc-v6.de-en.giza.loo peg
nc-v6.de-en.symgiza.loo pe
nv-v6.de-en.cs pe
nc-v6.de-en.cs.loo pe
--
ep-v6.de-en.cs p
ep-v6.de-en.cs.loo p
p: prep, e: extract, g: grammar, d: dtrain
</pre>
|