diff options
author | Patrick Simianer <p@simianer.de> | 2012-03-13 09:24:47 +0100 |
---|---|---|
committer | Patrick Simianer <p@simianer.de> | 2012-03-13 09:24:47 +0100 |
commit | ef6085e558e26c8819f1735425761103021b6470 (patch) | |
tree | 5cf70e4c48c64d838e1326b5a505c8c4061bff4a /sa-extract/README | |
parent | 10a232656a0c882b3b955d2bcfac138ce11e8a2e (diff) | |
parent | dfbc278c1057555fda9312291c8024049e00b7d8 (diff) |
merge with upstream
Diffstat (limited to 'sa-extract/README')
-rw-r--r-- | sa-extract/README | 62 |
1 files changed, 62 insertions, 0 deletions
diff --git a/sa-extract/README b/sa-extract/README new file mode 100644 index 00000000..e4022c7e --- /dev/null +++ b/sa-extract/README @@ -0,0 +1,62 @@ +SUFFIX-ARRAY-EXTRACT README + Feb 1, 2012 + +Written by Adam Lopez, repackaged by Chris Dyer. + +Originally based on parts of Hiero, by David Chiang, but these dependencies +have been removed or rewritten. + + +BUILD INSTRUCTIONS +============================================================================== + +Requirements: + Python 2.7 or later (http://www.python.org) + Cython 0.14.1 or later (http://cython.org/) + +- Edit Makefile to set the location of Python/Cython then do: + + make + + +COMPILING A PARALLEL CORPUS AND WORD ALIGNMENT +============================================================================== +- Run sa-compile.pl to compile the training data and generate an extract.ini + file (which is written to STDOUT): + + sa-compile.pl -b bitext_name=source.fr,target.en \ + -a alignment_name=alignment.txt > extract.ini + + + The training data should be in two parallel text files (source.fr,source.en) + and the alignments are expected in "0-0 1-2 2-1 ..." format produced by + most alignment toolkits. The text files should NOT be escaped for non-XML + characters. + + +EXTRACTION OF PER-SENTENCE GRAMMARS +============================================================================== +The most common use-case we support is extraction of "per-sentence" grammars +for each segment in a testset. You may run the extractor on test set, but it +will try to interpret tags as SGML markup, so we provide a script that does +escaping: ./escape-testset.pl. + +- Example: + + cat test.fr | ./escape-testset.pl | ./extractor.py -c extract.ini + + +EXTRACTION OF COMPLETE TEST-SET GRAMMARS +============================================================================== +Edit the generated extract.ini file a change per_sentence_grammar +to False. Then, run extraction as normal. + +Note: extracting a single grammar for an entire test set will consume more +memory during extraction and (probably) during decoding. + + +EXAMPLE +============================================================================== +- See example/ and the README therein. + + |