Hacker News new | past | comments | ask | show | jobs | submit login

Lazily converting UTF-8 (or latin1) to UTF-16 as needed is indeed an old trick employed by many string classes.

It's even a bit surprising that a codebase as popular and performance-critical as SpiderMonkey hadn't picked up such as a simple and high-yield optimization several years ago.

By the way, other implementations are even lazier: the string is kept in its original encoding (utf-8 or 7-bit ascii) until someone calls a method requiring indexed access to characters, such as the subscript operator. At this point, you convert to UTF-16 to for O(1) random access.

Indexing characters in a localized string is rarely useful to applications and often denotes a bug (did they want the N-th grapheme, glyph or code-point?). It's best to use higher-level primitives for collating, concatenating and splitting localized text.

Granted, a JavaScript interpreter must remain bug-by-bug compatible with existing code, thus precluding some of the most aggressive optimizations.




What do the lazily converting string classes do for characters that don't fit in UTF-16? Would they convert to UTF-32, or just fall back to an O(n) index?

Example: ☃


There are, by definition, no Unicode characters that don't fit in UTF-16.

UTF-16 has surrogate pairs; it's an extension of UCS-2, which doesn't.

Incidentally, this is why UTF-16 is a poor choice for a character encoding: you take twice the memory but you don't actually get O(1) indexing, you only think you do, and then your program breaks badly when someone inputs a surrogate pair.

See also elsewhere in the thread: https://news.ycombinator.com/item?id=8066284


String classes rarely use UTF-16 because it doesn't have fixed length code point representation. UCS-2 is often used instead, which uses two bytes to represent all the unicode points in the Basic Multilingual Plane (BMP), which is enough for 99.99% of the use cases.

One example of this is Python, which used UCS-2 until version 3.3. There was a compile time option to use UCS-4, but UCS-2 was enough for most cases because the BMP contains all the characters of all the languages currently in use.


Which encoding does Python use now?


PEP 393 introduced flexible string representation which can use 1, 2 or 4 bytes depending on the type of the string: http://legacy.python.org/dev/peps/pep-0393/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: