Hacker News new | past | comments | ask | show | jobs | submit login

When I want to segment words in some language, I usually check what Apache Lucene does. In this case, the Thai tokenizer [1] simply uses java.text.BreakIterator [2] and hopes that Thai is supported.

[1] https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=...

[2] https://docs.oracle.com/javase/10/docs/api/java/text/BreakIt...




Fantastic tip, thanks




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: