summaryrefslogtreecommitdiff
path: root/corpus
AgeCommit message (Expand)Author
2016-01-03corpus stats scriptChris Dyer
2015-06-06small fixesChris Dyer
2015-05-21deal with curly quotesChris Dyer
2015-04-14Parallel tokenizationmjdenkowski
2015-04-13Moses compatibility for tokenizerMichael Denkowski
2015-01-08Stop BOMbs before they decrease qualityKenneth Heafield
2014-12-29deal with eur symbolChris Dyer
2014-12-29finnish case markingsChris Dyer
2014-12-29fooChris Dyer
2014-12-29fooChris Dyer
2014-12-29finnish abbrevsChris Dyer
2014-12-20Generalize to sample any number of dev setsmjdenkowski
2014-12-19Sample dev and test sets with pseudo-documentsmjdenkowski
2014-10-25bit more infoChris Dyer
2014-10-24conll2cdec conversionChris Dyer
2014-09-28add error messageChris Dyer
2014-09-15migrate to new Cython versionChris Dyer
2014-06-03fix for nonjoining charsChris Dyer
2014-04-02moses conversion scriptChris Dyer
2014-03-18chris editsChris Dyer
2014-03-12XML file tokenization for all your WMT needs.mjdenkowski
2014-03-10few tokenization bugsChris Dyer
2014-02-27ptb to normalChris Dyer
2014-02-20Merge branch 'master' of https://github.com/redpony/cdecarmatthews
2014-02-20slight beautification and more sane orderingarmatthews
2014-02-15fix for missing angle quote formChris Dyer
2014-01-28smarter script for adding <s> and </s> markersChris Dyer
2014-01-23Reordered HTML entity blocksarmatthews
2014-01-23Merged quote-norm with Greg's WMT normalization scriptarmatthews
2014-01-20hindi monthsChris Dyer
2014-01-20deal with acronyms in hindiChris Dyer
2014-01-20hindi editsChris Dyer
2014-01-16moar hindiChris Dyer
2014-01-15deal with hindiChris Dyer
2013-12-12Restore unbuffered functionality as optionmjdenkowski
2013-11-11error on new macsChris Dyer
2013-09-11Use bash instead of shmjdenkowski
2013-09-05Slower but correct (wrt buffered) unbuffered version.Michael Denkowski
2013-09-05Unbuffered mode, flush after each line where possible, skip otherwiseMichael Denkowski
2013-09-04DetokenizerMichael Denkowski
2013-04-19Merge branch 'master' of https://github.com/redpony/cdecChris Dyer
2013-04-19hindiChris Dyer
2013-03-26swahili abbreviationsChris Dyer
2013-03-17fix possible utf8 bugChris Dyer
2013-03-08Merge branch 'master' of https://github.com/redpony/cdecChris Dyer
2013-03-08few preproc fixesChris Dyer
2013-02-27quick fixChris Dyer
2013-02-23one missing quote typeChris Dyer
2013-01-22russian abbrevsChris Dyer
2013-01-21tokenizer support for utf8 patternsChris Dyer