Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is you who does not know what he's talking about:

> UTF-8 has never been a 3 byte encoding

I never claimed that

> There's nothing recent about it

The non-BMP characters are "recent" because 10 years ago the non-BMP was not allocated except for some small areas. Also I said it "became popular recently", due to emoji. Before that, non-BMP codepoints were rarely used

> I don't think you can name a single file system that restricted UTF-8 to three bytes.

WAFL[1] (unless you "format" it as utf8mb4, which was only implemented a few years ago...)

[1] https://docs.netapp.com/ontap-9/topic/com.netapp.doc.cdot-fa...



Okay, perhaps a more detailed breakdown of the false claims will help, so that nobody is misled by revisionist/apologist history.

> Unicode was assumed to be 64k of codepoints

Not in 2002, when MySQL restricted their utf8 to three bytes [1]. Before 1997, Unicode specified clearly that 21 bits was the limit [2]. By 2002, there were 94,205 characters, including CJK characters beyond the 16 bit range, and clearly more to come. [3]

> so a 3-byte UTF-8 sequence was considered "long enough"

Not by many. The MySQL developers chose very badly, here. I and plenty of other developers managed to implement UTF-8 more correctly around that time. It wasn't hard, as the specs are very straightforward.

> especially since there were surrogate pairs for the rare cases where you have to encode higher code points

Surrogate pairs have never been supported in UTF-8. The RFCs are explicit about that. [4] [5] (search for D800)

Maybe you're thinking of CESU-8, though that's not intended for interchange. [6]

> Only "recently" have longer UTF-8 sequences (aka. emojis) become widespread enough that this became a problem.

Not supporting Unicode properly has always been problematic; it's just that bug reports from affected users rarely reached the right people. Emojis have done the world a favour in making less competent developers actually notice their bugs in basic text handling.

> Yes, it could have been avoided

And was, by most developers.

> they probably just wanted to optimize a bit.

They apparently altered a config number [1], so it wasn't an optimization decision; the code at the time still had support for 6-byte utf8 [7]. I would guess that they found a bug in their support for longer utf-8 sequences/conversion and took the hacky way out.

[1] https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...

[2] https://unicode.org/faq/utf_bom.html

[3] https://en.wikibooks.org/wiki/Unicode/Versions

[4] https://datatracker.ietf.org/doc/html/rfc2044

[5] https://datatracker.ietf.org/doc/html/rfc3629

[6] https://www.unicode.org/reports/tr26/tr26-4.html

[7] https://github.com/mysql/mysql-server/blob/43a506c0ced0e6ea1...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: