compound-split/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

Instructions for running the compound splitter, which is a reimplementation
and extension (more features, larger non-word list) of the model described in

  C. Dyer. (2009)  Using a maximum entropy model to build segmentation
            lattices for MT. In Proceedings of NAACL HLT 2009,
            Boulder, Colorado, June 2009

If you use this software, please cite this paper.


GENERATING 1-BEST SEGMENTATIONS AND LATTICES
------------------------------------------------------------------------------

Here are some sample invokations:

  ./compound-split.pl --output 1best < infile.txt > out.1best.txt
      Segment infile.txt according to the 1-best segmentation file.

  ./compound-split.pl --output plf < infile.txt > out.plf

  ./compound-split.pl --output plf --beam 3.5 < infile.txt > out.plf
      This generates denser lattices than usual (the default beam threshold
      is 2.2, higher numbers do less pruning)


MODEL TRAINING (only for the adventuresome)
------------------------------------------------------------------------------

I've included some training data for training a German language lattice
segmentation model, and if you want to explore, you can or change the data.
If you're especially adventuresome, you can add features to cdec (the current
feature functions are found in ff_csplit.cc).  The training/references are
in the file:

               dev.in-ref

The format is the unsegmented form on the right and the reference lattice on
the left, separated by a triple pipe ( ||| ).  Note that the segmentation
model inserts a # as the first word, so your segmentation references must
include this.

To retrain the model (using MAP estimation of a conditional model), do the
following:

  cd de
  ./TRAIN

Note, the optimization objective is supposed to be non-convex, but i haven't
found much of an effect of where I initialize things.  But I haven't looked
very hard- this might be something to explore.