Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I use NFKC form for scripts that seem to require it, such as Arabic, and NFC for others. If I used NFKC for English, for example, then encountering a brand name with a trademark sign on it would add the letters "tm" to the end of the word.

In general I use tokenization rules that follow the Unicode standards in [UAX 29], with language-specific external libraries for Chinese, Japanese, and Korean, and with some language-specific tweaks to handle cases the Unicode Consortium didn't go into. [0]

I use Python 3 strings, and it's a peculiar bit of abstraction-busting to worry about what they look like inside the Python interpreter. It's only UTF-32 for strings that contain high codepoints. See [PEP 393], "Flexible String Representation".

I don't think there is such a thing as "multi-character code points". At no point do I use UTF-16 (which has code points made of multiple surrogate code points, which are not characters), if that's what you're asking about.

[0] https://github.com/LuminosoInsight/wordfreq/blob/master/word...

[UAX 29] http://unicode.org/reports/tr29/ (fixed link)

[PEP 393] https://www.python.org/dev/peps/pep-0393/




Thanks for the info. I'm looking at this from the perspective of designing a backend datastore and query engine for a knowledge system. The idea is to encode a spatial data structure (similar to Google's S2 Geometry Library [0]) that enables content-based addressing of non-spatial data types for data fusion.

One idea is to make a lattice of unicode characters that builds up to combination of words a la Formal Concept Analysis [1] -- on one level, the characters compose into words that represent properties (key/value pairs), and then the KV pairs compose into higher-level objects. Each property and higher-level object is encoded with an integer derived from its constituent objects/properties, and each object is encoded in such a way that its constituent objects/properties can be determined algorithmically from the integer without having to traverse the structure [2] -- ANS encoding [3] embedded into a space with a VI metric (https://en.wikipedia.org/wiki/Variation_of_information) [4] might make this work. Have you played with this type of design?

[0] http://blog.christianperone.com/2015/08/googles-s2-geometry-...

[1] https://www.youtube.com/watch?v=Xuxm929tIRY

[2] http://www09.sigmod.org/sigmod/record/issues/0506/p47-articl...

[3] https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems

[4] http://www.sciencedirect.com/science/article/pii/S0047259X06...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: