Run ./tokenize.sh to tokenize text Edit eng_token_patterns and eng_token_list to add rules for things not to segment