summaryrefslogtreecommitdiff
path: root/sa-extract/README
diff options
context:
space:
mode:
authorChris Dyer <cdyer@cs.cmu.edu>2012-02-03 18:03:49 -0500
committerChris Dyer <cdyer@cs.cmu.edu>2012-02-03 18:03:49 -0500
commit3a2fc36378337147a956e439db31baf91bfb95c8 (patch)
treee096fa0d0628fe3d09bb8dc0dcc0d15f617eb32d /sa-extract/README
parentdbf367e0fc9d3faf906340d1f51f2dbda1892081 (diff)
escaping tool for grammar extractor
Diffstat (limited to 'sa-extract/README')
-rw-r--r--sa-extract/README14
1 files changed, 13 insertions, 1 deletions
diff --git a/sa-extract/README b/sa-extract/README
index f43e58cc..e4022c7e 100644
--- a/sa-extract/README
+++ b/sa-extract/README
@@ -28,10 +28,22 @@ COMPILING A PARALLEL CORPUS AND WORD ALIGNMENT
-a alignment_name=alignment.txt > extract.ini
+ The training data should be in two parallel text files (source.fr,source.en)
+ and the alignments are expected in "0-0 1-2 2-1 ..." format produced by
+ most alignment toolkits. The text files should NOT be escaped for non-XML
+ characters.
+
+
EXTRACTION OF PER-SENTENCE GRAMMARS
==============================================================================
+The most common use-case we support is extraction of "per-sentence" grammars
+for each segment in a testset. You may run the extractor on test set, but it
+will try to interpret tags as SGML markup, so we provide a script that does
+escaping: ./escape-testset.pl.
+
- Example:
- cat test.fr | extractor.py -c extract.ini
+
+ cat test.fr | ./escape-testset.pl | ./extractor.py -c extract.ini
EXTRACTION OF COMPLETE TEST-SET GRAMMARS