summaryrefslogtreecommitdiff
path: root/corpus
AgeCommit message (Collapse)Author
2014-12-29fooChris Dyer
2014-12-29fooChris Dyer
2014-12-29finnish abbrevsChris Dyer
2014-12-20Generalize to sample any number of dev setsmjdenkowski
2014-12-19Sample dev and test sets with pseudo-documentsmjdenkowski
2014-10-25bit more infoChris Dyer
2014-10-24conll2cdec conversionChris Dyer
2014-09-28add error messageChris Dyer
2014-09-15migrate to new Cython versionChris Dyer
2014-06-03fix for nonjoining charsChris Dyer
2014-04-02moses conversion scriptChris Dyer
2014-03-18chris editsChris Dyer
2014-03-12XML file tokenization for all your WMT needs.mjdenkowski
2014-03-10few tokenization bugsChris Dyer
2014-02-27ptb to normalChris Dyer
2014-02-20Merge branch 'master' of https://github.com/redpony/cdecarmatthews
2014-02-20slight beautification and more sane orderingarmatthews
2014-02-15fix for missing angle quote formChris Dyer
2014-01-28smarter script for adding <s> and </s> markersChris Dyer
2014-01-23Reordered HTML entity blocksarmatthews
2014-01-23Merged quote-norm with Greg's WMT normalization scriptarmatthews
2014-01-20hindi monthsChris Dyer
2014-01-20deal with acronyms in hindiChris Dyer
2014-01-20hindi editsChris Dyer
2014-01-16moar hindiChris Dyer
2014-01-15deal with hindiChris Dyer
2013-12-12Restore unbuffered functionality as optionmjdenkowski
2013-11-11error on new macsChris Dyer
2013-09-11Use bash instead of shmjdenkowski
2013-09-05Slower but correct (wrt buffered) unbuffered version.Michael Denkowski
2013-09-05Unbuffered mode, flush after each line where possible, skip otherwiseMichael Denkowski
2013-09-04DetokenizerMichael Denkowski
2013-04-19Merge branch 'master' of https://github.com/redpony/cdecChris Dyer
2013-04-19hindiChris Dyer
2013-03-26swahili abbreviationsChris Dyer
2013-03-17fix possible utf8 bugChris Dyer
2013-03-08Merge branch 'master' of https://github.com/redpony/cdecChris Dyer
2013-03-08few preproc fixesChris Dyer
2013-02-27quick fixChris Dyer
2013-02-23one missing quote typeChris Dyer
2013-01-22russian abbrevsChris Dyer
2013-01-21tokenizer support for utf8 patternsChris Dyer
2013-01-21a little bit of cleanupChris Dyer
2013-01-20control max lenChris Dyer
2013-01-19updated version of boost.m4 and automatically build kenneth's LM builderChris Dyer
2013-01-15corpus filesChris Dyer
2012-12-05slight tokenization bug fixChris Dyer
2012-12-05remove logging, you should be using pvChris Dyer
2012-12-04more flexible corpus cuttingChris Dyer
2012-11-16fixChris Dyer