I use NFKC form for scripts that seem to require it, such as Arabic, and NFC for others. If I used NFKC for English, for example, then encountering a brand name with a trademark sign on it would add the letters "tm" to the end of the word.
In general I use tokenization rules that follow the Unicode standards in [UAX 29], with language-specific external libraries for Chinese, Japanese, and Korean, and with some language-specific tweaks to handle cases the Unicode Consortium didn't go into. [0]
I use Python 3 strings, and it's a peculiar bit of abstraction-busting to worry about what they look like inside the Python interpreter. It's only UTF-32 for strings that contain high codepoints. See [PEP 393], "Flexible String Representation".
I don't think there is such a thing as "multi-character code points". At no point do I use UTF-16 (which has code points made of multiple surrogate code points, which are not characters), if that's what you're asking about.
Thanks for the info. I'm looking at this from the perspective of designing a backend datastore and query engine for a knowledge system. The idea is to encode a spatial data structure (similar to Google's S2 Geometry Library [0]) that enables content-based addressing of non-spatial data types for data fusion.
One idea is to make a lattice of unicode characters that builds up to combination of words a la Formal Concept Analysis [1] -- on one level, the characters compose into words that represent properties (key/value pairs), and then the KV pairs compose into higher-level objects. Each property and higher-level object is encoded with an integer derived from its constituent objects/properties, and each object is encoded in such a way that its constituent objects/properties can be determined algorithmically from the integer without having to traverse the structure [2] -- ANS encoding [3] embedded into a space with a VI metric (https://en.wikipedia.org/wiki/Variation_of_information) [4] might make this work. Have you played with this type of design?
In general I use tokenization rules that follow the Unicode standards in [UAX 29], with language-specific external libraries for Chinese, Japanese, and Korean, and with some language-specific tweaks to handle cases the Unicode Consortium didn't go into. [0]
I use Python 3 strings, and it's a peculiar bit of abstraction-busting to worry about what they look like inside the Python interpreter. It's only UTF-32 for strings that contain high codepoints. See [PEP 393], "Flexible String Representation".
I don't think there is such a thing as "multi-character code points". At no point do I use UTF-16 (which has code points made of multiple surrogate code points, which are not characters), if that's what you're asking about.
[0] https://github.com/LuminosoInsight/wordfreq/blob/master/word...
[UAX 29] http://unicode.org/reports/tr29/ (fixed link)
[PEP 393] https://www.python.org/dev/peps/pep-0393/