escaping tool for grammar extractor

author: Chris Dyer <cdyer@cs.cmu.edu> 2012-02-03 18:03:49 -0500
committer: Chris Dyer <cdyer@cs.cmu.edu> 2012-02-03 18:03:49 -0500
commit: f08ff03664ee7c9601c9daaa217cb032160f386f (patch)
tree: 5e93393df8bddb128a778f29ea86a0ea81ce7ebf /sa-extract/README
parent: 16d08eefddbecfefced16a0dd5a13d4c64c139b0 (diff)
1 files changed, 13 insertions, 1 deletions
diff --git a/sa-extract/README b/sa-extract/README
index f43e58cc..e4022c7e 100644
--- a/sa-extract/README
+++ b/sa-extract/README
@@ -28,10 +28,22 @@ COMPILING A PARALLEL CORPUS AND WORD ALIGNMENT
                 -a alignment_name=alignment.txt > extract.ini
 
 
+  The training data should be in two parallel text files (source.fr,source.en)
+  and the alignments are expected in "0-0 1-2 2-1 ..." format produced by
+  most alignment toolkits. The text files should NOT be escaped for non-XML
+  characters.
+
+
 EXTRACTION OF PER-SENTENCE GRAMMARS
 ==============================================================================
+The most common use-case we support is extraction of "per-sentence" grammars
+for each segment in a testset. You may run the extractor on test set, but it
+will try to interpret tags as SGML markup, so we provide a script that does
+escaping: ./escape-testset.pl.
+
 - Example:
-  cat test.fr | extractor.py -c extract.ini
+
+  cat test.fr | ./escape-testset.pl | ./extractor.py -c extract.ini
 
 
 EXTRACTION OF COMPLETE TEST-SET GRAMMARS
author	Chris Dyer <cdyer@cs.cmu.edu>	2012-02-03 18:03:49 -0500
committer	Chris Dyer <cdyer@cs.cmu.edu>	2012-02-03 18:03:49 -0500
commit	f08ff03664ee7c9601c9daaa217cb032160f386f (patch)
tree	5e93393df8bddb128a778f29ea86a0ea81ce7ebf /sa-extract/README
parent	16d08eefddbecfefced16a0dd5a13d4c64c139b0 (diff)