Lazily converting UTF-8 (or latin1) to UTF-16 as needed is indeed an old trick e...

nitrogen · on July 22, 2014

What do the lazily converting string classes do for characters that don't fit in UTF-16? Would they convert to UTF-32, or just fall back to an O(n) index?

Example: ☃

hdevalence · on July 22, 2014

There are, by definition, no Unicode characters that don't fit in UTF-16.

UTF-16 has surrogate pairs; it's an extension of UCS-2, which doesn't.

Incidentally, this is why UTF-16 is a poor choice for a character encoding: you take twice the memory but you don't actually get O(1) indexing, you only think you do, and then your program breaks badly when someone inputs a surrogate pair.

See also elsewhere in the thread: https://news.ycombinator.com/item?id=8066284

ceronman · on July 22, 2014

String classes rarely use UTF-16 because it doesn't have fixed length code point representation. UCS-2 is often used instead, which uses two bytes to represent all the unicode points in the Basic Multilingual Plane (BMP), which is enough for 99.99% of the use cases.

One example of this is Python, which used UCS-2 until version 3.3. There was a compile time option to use UCS-4, but UCS-2 was enough for most cases because the BMP contains all the characters of all the languages currently in use.

chadzawistowski · on July 22, 2014

Which encoding does Python use now?

ceronman · on July 22, 2014

PEP 393 introduced flexible string representation which can use 1, 2 or 4 bytes depending on the type of the string: http://legacy.python.org/dev/peps/pep-0393/