initial checkin

git-svn-id: https://ws10smt.googlecode.com/svn/trunk@2 ec762483-ff6d-05da-a07a-a48fb63a330f
author: redpony <redpony@ec762483-ff6d-05da-a07a-a48fb63a330f> 2010-06-22 05:12:27 +0000
committer: redpony <redpony@ec762483-ff6d-05da-a07a-a48fb63a330f> 2010-06-22 05:12:27 +0000
commit: 7cc92b65a3185aa242088d830e166e495674efc9 (patch)
tree: 681fe5237612a4e96ce36fb9fabef00042c8ee61 /compound-split/README
parent: 37728b8be4d0b3df9da81fdda2198ff55b4b2d91 (diff)
1 files changed, 51 insertions, 0 deletions
diff --git a/compound-split/README b/compound-split/README
new file mode 100644
index 00000000..b7491007
--- /dev/null
+++ b/compound-split/README
@@ -0,0 +1,51 @@
+Instructions for running the compound splitter, which is a reimplementation
+and extension (more features, larger non-word list) of the model described in
+
+  C. Dyer. (2009)  Using a maximum entropy model to build segmentation
+            lattices for MT. In Proceedings of NAACL HLT 2009,
+            Boulder, Colorado, June 2009
+
+If you use this software, please cite this paper.
+
+
+GENERATING 1-BEST SEGMENTATIONS AND LATTICES
+------------------------------------------------------------------------------
+
+Here are some sample invokations:
+
+  ./compound-split.pl --output 1best < infile.txt > out.1best.txt
+      Segment infile.txt according to the 1-best segmentation file.
+
+  ./compound-split.pl --output plf < infile.txt > out.plf
+
+  ./compound-split.pl --output plf --beam 3.5 < infile.txt > out.plf
+      This generates denser lattices than usual (the default beam threshold
+      is 2.2, higher numbers do less pruning)
+
+
+MODEL TRAINING (only for the adventuresome)
+------------------------------------------------------------------------------
+
+I've included some training data for training a German language lattice
+segmentation model, and if you want to explore, you can or change the data.
+If you're especially adventuresome, you can add features to cdec (the current
+feature functions are found in ff_csplit.cc).  The training/references are
+in the file:
+
+               dev.in-ref
+
+The format is the unsegmented form on the right and the reference lattice on
+the left, separated by a triple pipe ( ||| ).  Note that the segmentation
+model inserts a # as the first word, so your segmentation references must
+include this.
+
+To retrain the model (using MAP estimation of a conditional model), do the
+following:
+
+  cd de
+  ./TRAIN
+
+Note, the optimization objective is supposed to be non-convex, but i haven't
+found much of an effect of where I initialize things.  But I haven't looked
+very hard- this might be something to explore.
+
author	redpony <redpony@ec762483-ff6d-05da-a07a-a48fb63a330f>	2010-06-22 05:12:27 +0000
committer	redpony <redpony@ec762483-ff6d-05da-a07a-a48fb63a330f>	2010-06-22 05:12:27 +0000
commit	7cc92b65a3185aa242088d830e166e495674efc9 (patch)
tree	681fe5237612a4e96ce36fb9fabef00042c8ee61 /compound-split/README
parent	37728b8be4d0b3df9da81fdda2198ff55b4b2d91 (diff)