When I want to segment words in some language, I usually check what Apache Lucen...

yorwba on Sept 5, 2018 | parent | context | favorite | on: “Prestudy”: Learning Chinese Through Reading

When I want to segment words in some language, I usually check what Apache Lucene does. In this case, the Thai tokenizer [1] simply uses java.text.BreakIterator [2] and hopes that Thai is supported.

[1] https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=...

[2] https://docs.oracle.com/javase/10/docs/api/java/text/BreakIt...

peteretep on Sept 6, 2018 [–]

Fantastic tip, thanks