diff options
author | Patrick Simianer <simianer@cl.uni-heidelberg.de> | 2012-08-01 17:32:37 +0200 |
---|---|---|
committer | Patrick Simianer <simianer@cl.uni-heidelberg.de> | 2012-08-01 17:32:37 +0200 |
commit | 3f8e33cfe481a09c121a410e66a6074b5d05683e (patch) | |
tree | a41ecaf0bbb69fa91a581623abe89d41219c04f8 /sa-extract/README | |
parent | c139ce495861bb341e1b86a85ad4559f9ad53c14 (diff) | |
parent | 9fe0219562e5db25171cce8776381600ff9a5649 (diff) |
Merge remote-tracking branch 'upstream/master'
Diffstat (limited to 'sa-extract/README')
-rw-r--r-- | sa-extract/README | 62 |
1 files changed, 0 insertions, 62 deletions
diff --git a/sa-extract/README b/sa-extract/README deleted file mode 100644 index e4022c7e..00000000 --- a/sa-extract/README +++ /dev/null @@ -1,62 +0,0 @@ -SUFFIX-ARRAY-EXTRACT README - Feb 1, 2012 - -Written by Adam Lopez, repackaged by Chris Dyer. - -Originally based on parts of Hiero, by David Chiang, but these dependencies -have been removed or rewritten. - - -BUILD INSTRUCTIONS -============================================================================== - -Requirements: - Python 2.7 or later (http://www.python.org) - Cython 0.14.1 or later (http://cython.org/) - -- Edit Makefile to set the location of Python/Cython then do: - - make - - -COMPILING A PARALLEL CORPUS AND WORD ALIGNMENT -============================================================================== -- Run sa-compile.pl to compile the training data and generate an extract.ini - file (which is written to STDOUT): - - sa-compile.pl -b bitext_name=source.fr,target.en \ - -a alignment_name=alignment.txt > extract.ini - - - The training data should be in two parallel text files (source.fr,source.en) - and the alignments are expected in "0-0 1-2 2-1 ..." format produced by - most alignment toolkits. The text files should NOT be escaped for non-XML - characters. - - -EXTRACTION OF PER-SENTENCE GRAMMARS -============================================================================== -The most common use-case we support is extraction of "per-sentence" grammars -for each segment in a testset. You may run the extractor on test set, but it -will try to interpret tags as SGML markup, so we provide a script that does -escaping: ./escape-testset.pl. - -- Example: - - cat test.fr | ./escape-testset.pl | ./extractor.py -c extract.ini - - -EXTRACTION OF COMPLETE TEST-SET GRAMMARS -============================================================================== -Edit the generated extract.ini file a change per_sentence_grammar -to False. Then, run extraction as normal. - -Note: extracting a single grammar for an entire test set will consume more -memory during extraction and (probably) during decoding. - - -EXAMPLE -============================================================================== -- See example/ and the README therein. - - |