It is you who does not know what he's talking about:
> UTF-8 has never been a 3 byte encoding
I never claimed that
> There's nothing recent about it
The non-BMP characters are "recent" because 10 years ago the non-BMP was not allocated except for some small areas. Also I said it "became popular recently", due to emoji. Before that, non-BMP codepoints were rarely used
> I don't think you can name a single file system that restricted UTF-8 to three bytes.
WAFL[1] (unless you "format" it as utf8mb4, which was only implemented a few years ago...)
Okay, perhaps a more detailed breakdown of the false claims will help, so that nobody is misled by revisionist/apologist history.
> Unicode was assumed to be 64k of codepoints
Not in 2002, when MySQL restricted their utf8 to three bytes [1]. Before 1997, Unicode specified clearly that 21 bits was the limit [2]. By 2002, there were 94,205 characters, including CJK characters beyond the 16 bit range, and clearly more to come. [3]
> so a 3-byte UTF-8 sequence was considered "long enough"
Not by many. The MySQL developers chose very badly, here. I and plenty of other developers managed to implement UTF-8 more correctly around that time. It wasn't hard, as the specs are very straightforward.
> especially since there were surrogate pairs for the rare cases where you have to encode higher code points
Surrogate pairs have never been supported in UTF-8. The RFCs are explicit about that. [4] [5] (search for D800)
Maybe you're thinking of CESU-8, though that's not intended for interchange. [6]
> Only "recently" have longer UTF-8 sequences (aka. emojis) become widespread enough that this became a problem.
Not supporting Unicode properly has always been problematic; it's just that bug reports from affected users rarely reached the right people. Emojis have done the world a favour in making less competent developers actually notice their bugs in basic text handling.
> Yes, it could have been avoided
And was, by most developers.
> they probably just wanted to optimize a bit.
They apparently altered a config number [1], so it wasn't an optimization decision; the code at the time still had support for 6-byte utf8 [7]. I would guess that they found a bug in their support for longer utf-8 sequences/conversion and took the hacky way out.
> UTF-8 has never been a 3 byte encoding
I never claimed that
> There's nothing recent about it
The non-BMP characters are "recent" because 10 years ago the non-BMP was not allocated except for some small areas. Also I said it "became popular recently", due to emoji. Before that, non-BMP codepoints were rarely used
> I don't think you can name a single file system that restricted UTF-8 to three bytes.
WAFL[1] (unless you "format" it as utf8mb4, which was only implemented a few years ago...)
[1] https://docs.netapp.com/ontap-9/topic/com.netapp.doc.cdot-fa...