Lazily converting UTF-8 (or latin1) to UTF-16 as needed is indeed an old trick employed by many string classes.
It's even a bit surprising that a codebase as popular and performance-critical as SpiderMonkey hadn't picked up such as a simple and high-yield optimization several years ago.
By the way, other implementations are even lazier: the string is kept in its original encoding (utf-8 or 7-bit ascii) until someone calls a method requiring indexed access to characters, such as the subscript operator. At this point, you convert to UTF-16 to for O(1) random access.
Indexing characters in a localized string is rarely useful to applications and often denotes a bug (did they want the N-th grapheme, glyph or code-point?). It's best to use higher-level primitives for collating, concatenating and splitting localized text.
Granted, a JavaScript interpreter must remain bug-by-bug compatible with existing code, thus precluding some of the most aggressive optimizations.
What do the lazily converting string classes do for characters that don't fit in UTF-16? Would they convert to UTF-32, or just fall back to an O(n) index?
There are, by definition, no Unicode characters that don't fit in UTF-16.
UTF-16 has surrogate pairs; it's an extension of UCS-2, which doesn't.
Incidentally, this is why UTF-16 is a poor choice for a character encoding: you take twice the memory but you don't actually get O(1) indexing, you only think you do, and then your program breaks badly when someone inputs a surrogate pair.
String classes rarely use UTF-16 because it doesn't have fixed length code point representation. UCS-2 is often used instead, which uses two bytes to represent all the unicode points in the Basic Multilingual Plane (BMP), which is enough for 99.99% of the use cases.
One example of this is Python, which used UCS-2 until version 3.3. There was a compile time option to use UCS-4, but UCS-2 was enough for most cases because the BMP contains all the characters of all the languages currently in use.
It's even a bit surprising that a codebase as popular and performance-critical as SpiderMonkey hadn't picked up such as a simple and high-yield optimization several years ago.
By the way, other implementations are even lazier: the string is kept in its original encoding (utf-8 or 7-bit ascii) until someone calls a method requiring indexed access to characters, such as the subscript operator. At this point, you convert to UTF-16 to for O(1) random access.
Indexing characters in a localized string is rarely useful to applications and often denotes a bug (did they want the N-th grapheme, glyph or code-point?). It's best to use higher-level primitives for collating, concatenating and splitting localized text.
Granted, a JavaScript interpreter must remain bug-by-bug compatible with existing code, thus precluding some of the most aggressive optimizations.