UTF-8 has no upper limit on the number of possible characters / emoji's, now or ...

doubleunplussed · on Feb 13, 2023

That's not quite right. UTF-8 is not arbitrary length.

Officially, it's at most four bytes, of which 21 bits are usable for encoding codepoints - so that's an upper limit of 2^21 codepoints.

There is an initial byte encoding the length as a series of ones, so if you went ahead and extended the standard to simply allow more bytes, you could get up to 8 bytes, of which 48 bits would be usable.

I can see that a six-byte version with 31 data bits was previously standardised before they settled on four.

I guess you could extend it further by allowing more than one initial byte encoding the length, then it would be arbitrary length. But at that point I'm not sure if it loses its self-synchronising ability, and in any case it would be a different standard at that point.

bhaney · on Feb 13, 2023

> if you went ahead and extended the standard to simply allow more bytes, you could get up to 8 bytes

I think you'd only be able to go up to 7, since 10xxxxxx is still reserved for trailing octets. And even with 7, the entire first octet is consumed by the length indicator alone.

So you get 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx, 111110xx, 1111110x, and 11111110 as the 7 different length-indicating head octets. In the last case, you'd have 36 usable bits for encoding a codepoint.

doubleunplussed · on Feb 14, 2023

Ah, I forgot 10xxxxxx was not usable, but I also forgot 0xxxxxxx was. What about 11111111? If that's valid then it's 8, if I'm thinking straight.

bhaney · on Feb 14, 2023

11111111 is technically possible to use, but it would cause some problems. Sending it over the wire would break telnet, for example. Also since we already introduced 11111110 for 7-byte encodes, we're getting dangerously close to making the UTF-16 BOM character (11111111 11111110) accidentally show up in UTF-8 (this is also why 11111110 wasn't in the original maximum-6-byte UTF-8 spec). I still don't think it's possible to have the UTF-16 BOM show up in our hypothetical extended UTF-8, since 11111111 could never be immediately followed by 11111110 (or vice versa) in a well-formed UTF-8 stream.

Also note that if you did add 11111111 as a valid head octet representing an 8 octet long encoding, you'd still only have 42 usable bits (since the first byte is still entirely consumed by the length indicator)