summaryrefslogtreecommitdiff
path: root/sa-extract/README
diff options
context:
space:
mode:
authorChris Dyer <prguest11@taipan.cs>2012-02-02 06:29:50 +0000
committerChris Dyer <prguest11@taipan.cs>2012-02-02 06:29:50 +0000
commit8e5fad9bcbadf36bbab3c1c5b053e3c8f7dddbce (patch)
tree9c812b3f267aa1975cdf8b7af928c4b20eb36f93 /sa-extract/README
parentff496d3089e84846c8562c574155d8df1e4d911c (diff)
lopez suffix array extractor with copyrighted david chiang code excised
Diffstat (limited to 'sa-extract/README')
-rw-r--r--sa-extract/README50
1 files changed, 50 insertions, 0 deletions
diff --git a/sa-extract/README b/sa-extract/README
new file mode 100644
index 00000000..f43e58cc
--- /dev/null
+++ b/sa-extract/README
@@ -0,0 +1,50 @@
+SUFFIX-ARRAY-EXTRACT README
+ Feb 1, 2012
+
+Written by Adam Lopez, repackaged by Chris Dyer.
+
+Originally based on parts of Hiero, by David Chiang, but these dependencies
+have been removed or rewritten.
+
+
+BUILD INSTRUCTIONS
+==============================================================================
+
+Requirements:
+ Python 2.7 or later (http://www.python.org)
+ Cython 0.14.1 or later (http://cython.org/)
+
+- Edit Makefile to set the location of Python/Cython then do:
+
+ make
+
+
+COMPILING A PARALLEL CORPUS AND WORD ALIGNMENT
+==============================================================================
+- Run sa-compile.pl to compile the training data and generate an extract.ini
+ file (which is written to STDOUT):
+
+ sa-compile.pl -b bitext_name=source.fr,target.en \
+ -a alignment_name=alignment.txt > extract.ini
+
+
+EXTRACTION OF PER-SENTENCE GRAMMARS
+==============================================================================
+- Example:
+ cat test.fr | extractor.py -c extract.ini
+
+
+EXTRACTION OF COMPLETE TEST-SET GRAMMARS
+==============================================================================
+Edit the generated extract.ini file a change per_sentence_grammar
+to False. Then, run extraction as normal.
+
+Note: extracting a single grammar for an entire test set will consume more
+memory during extraction and (probably) during decoding.
+
+
+EXAMPLE
+==============================================================================
+- See example/ and the README therein.
+
+