fix

author: Chris Dyer <cdyer@cs.cmu.edu> 2012-11-16 00:24:52 -0500
committer: Chris Dyer <cdyer@cs.cmu.edu> 2012-11-16 00:24:52 -0500
commit: db9897bcafe5f732cee5c1c0fe5c9d9eaecdef0e (patch)
tree: 08666a65a28e0e03520f024c62e9518039a88070
parent: 0fcf21f26c77ccc22f14e66a15ef3c51080d12ef (diff)
1 files changed, 5 insertions, 0 deletions
diff --git a/corpus/README.md b/corpus/README.md
index 935d9a65..adc35b84 100644
--- a/corpus/README.md
+++ b/corpus/README.md
@@ -5,11 +5,13 @@ Many of these scripts assume that the input is [UTF-8 encoded](http://en.wikiped
 ## Paste parallel files together
 
 This script reads one line at a time from a set of files and concatenates them with a triple pipe separator (`|||`) in the output. This is useful for generating parallel corpora files for training or evaluation:
+
     ./paste-files.pl file.a file.b file.c [...]
 
 ## Punctuation Normalization and Tokenization
 
 This script tokenizes text in any language (well, it does a good job in most languages, and in some it will completely go crazy):
+
     ./tokenize-anything.sh < input.txt > output.txt
 
 It also normalizes a lot of unicode symbols and even corrects some common encoding errors. It can be applied to monolingual and parallel corpora directly.
@@ -17,16 +19,19 @@ It also normalizes a lot of unicode symbols and even corrects some common encodi
 ## Text lowercasing
 
 This script also does what it says, provided your input is in UTF8:
+
     ./lowercase.pl < input.txt > output.txt
 
 ## Length ratio filtering (for parallel corpora)
 
 This script computes statistics about sentence length ratios in a parallel corpus and removes sentences that are statistical outliers. This tends to remove extremely poorly aligned sentence pairs or sentence pairs that would otherwise be difficult to align:
+
     ./filter-length.pl input.src-trg > output.src-trg
 
 ## Add infrequent self-transaltions to a parallel corpus
 
 This script identifies rare words (those that occur less than 2 times in the corpus) and which have the same orthographic form in both the source and target language. Several copies of these words are then inserted at the end of the corpus that is written, which improves alignment quality.
+
     ./add-self-translations.pl input.src-trg > output.src-trg
author	Chris Dyer <cdyer@cs.cmu.edu>	2012-11-16 00:24:52 -0500
committer	Chris Dyer <cdyer@cs.cmu.edu>	2012-11-16 00:24:52 -0500
commit	db9897bcafe5f732cee5c1c0fe5c9d9eaecdef0e (patch)
tree	08666a65a28e0e03520f024c62e9518039a88070
parent	0fcf21f26c77ccc22f14e66a15ef3c51080d12ef (diff)