diff options
author | redpony <redpony@ec762483-ff6d-05da-a07a-a48fb63a330f> | 2010-06-22 05:12:27 +0000 |
---|---|---|
committer | redpony <redpony@ec762483-ff6d-05da-a07a-a48fb63a330f> | 2010-06-22 05:12:27 +0000 |
commit | 0172721855098ca02b207231a654dffa5e4eb1c9 (patch) | |
tree | 8069c3a62e2d72bd64a2cdeee9724b2679c8a56b /compound-split/README | |
parent | 37728b8be4d0b3df9da81fdda2198ff55b4b2d91 (diff) |
initial checkin
git-svn-id: https://ws10smt.googlecode.com/svn/trunk@2 ec762483-ff6d-05da-a07a-a48fb63a330f
Diffstat (limited to 'compound-split/README')
-rw-r--r-- | compound-split/README | 51 |
1 files changed, 51 insertions, 0 deletions
diff --git a/compound-split/README b/compound-split/README new file mode 100644 index 00000000..b7491007 --- /dev/null +++ b/compound-split/README @@ -0,0 +1,51 @@ +Instructions for running the compound splitter, which is a reimplementation +and extension (more features, larger non-word list) of the model described in + + C. Dyer. (2009) Using a maximum entropy model to build segmentation + lattices for MT. In Proceedings of NAACL HLT 2009, + Boulder, Colorado, June 2009 + +If you use this software, please cite this paper. + + +GENERATING 1-BEST SEGMENTATIONS AND LATTICES +------------------------------------------------------------------------------ + +Here are some sample invokations: + + ./compound-split.pl --output 1best < infile.txt > out.1best.txt + Segment infile.txt according to the 1-best segmentation file. + + ./compound-split.pl --output plf < infile.txt > out.plf + + ./compound-split.pl --output plf --beam 3.5 < infile.txt > out.plf + This generates denser lattices than usual (the default beam threshold + is 2.2, higher numbers do less pruning) + + +MODEL TRAINING (only for the adventuresome) +------------------------------------------------------------------------------ + +I've included some training data for training a German language lattice +segmentation model, and if you want to explore, you can or change the data. +If you're especially adventuresome, you can add features to cdec (the current +feature functions are found in ff_csplit.cc). The training/references are +in the file: + + dev.in-ref + +The format is the unsegmented form on the right and the reference lattice on +the left, separated by a triple pipe ( ||| ). Note that the segmentation +model inserts a # as the first word, so your segmentation references must +include this. + +To retrain the model (using MAP estimation of a conditional model), do the +following: + + cd de + ./TRAIN + +Note, the optimization objective is supposed to be non-convex, but i haven't +found much of an effect of where I initialize things. But I haven't looked +very hard- this might be something to explore. + |