Does this even fully fix the problem? It looks like utf8mb4 is limited to 4 byte...

jerf · on Jan 12, 2022

Flags are not single code points. UTF-8 refers to how code points are stored. If you look in your link at "Hex Code Point(s)", it is that first one that would be a problem with "utf8" in MySQL, because of the 1 in the 0x10000 position. The other six code points required would fit in fine.

Unicode is developing more and more things that require code points. I'm not sure what the longest legal non-redundant series of code points that can validly represent a glyph somewhere is, but it's getting up there with all the emoji skin modifiers and such.

jeroenhd · on Jan 12, 2022

Emoji modifiers, both for gender and skintone, do not produce that many extra code points. Combinations are made in the same way ¨ + e can combine into ë: a code point, followed by a combination code point, followed by a modifier. During text rendering, these code points are converted back into a single glyph.

All UTF-8 codepoints in use today can be encoded with four bytes. Theoretically the Unicode system can be used to create 6 byte code points if that ever becomes necessary, but it won't be for a while. Crossing the 4 byte boundary would also introduce compatibility issues with UTF-16, so I'm sure the Unicode Consortium will do their best to prevent this from happening as long as they can.

jerf · on Jan 12, 2022

I missed a word, sorry. I was idly musing about the longest legal code point sequences.

jeroenhd · on Jan 13, 2022

Aaaah, that makes sense. I think the flag of Scotland (󠁧󠁢󠁳󠁣󠁴󠁿) is the longest usable one I've seen, but you could stack near infinite items in Zalgo form on top of normal letters if you count those. I don't think Unicode has any restrictions on the amount of co combinatory characters, though most text parsers will probably enforce some kind of limit.

jeroenhd · on Jan 12, 2022

Is this a problem? Flags in unicode are defined by several special characters. The flag of Scotland isn't really a single character, it's "<waving flag><tag g><tag b><tag s><tag c><tag t><cancel>".

All of these characters are multi byte combinations. The hex for the flag is not a single, super wide character, it's 0xF09F8FB4 0xF3A081A7 0xF3A081A2 0xF3A081B3 0xF3A081A3 0xF3A081B4 0xF3A081BF. You might get some weird results if you take substrings from that, but it won't be a problem for the backing database store; each separate "binary character" is a four byte sequence (as denoted by the 0xF at the front of the number).

warpspin · on Jan 12, 2022

The biggest codepoint in Unicode fits into 4 bytes of UTF-8. UTF-8 would allow up to 6 bytes, but those codepoints are not in use currently. If they ever become in use, yes, you'd probably need a new character set again. But then a lot more things will break, as higher codepoints would be incompatible with UTF-16 also.

ghusbands · on Jan 12, 2022

UTF-8 only allows 4 bytes, since 2003: https://datatracker.ietf.org/doc/html/rfc3629

fredoralive · on Jan 12, 2022

Legal UTF8 is limited to 4 bytes, as Unicode only uses ranges that fit the limits of UTF16.

AFAIK The flags are a weird multi-code point encoding of the ISO country codes, and each individual code point is less than 4 bytes.

loeg · on Jan 12, 2022

UTF-8 is variable width. The biggest valid codepoint is U+10FFFF, which has a 4-byte encoding in UTF-8. Other codepoints have 1-, 2-, or 3-byte encodings.

zhte415 · on Jan 12, 2022

Variable width is unlikely to be a problem. At 3:5 the Scottish flag does not have an unusual aspect ratio. This is unlike the flag of Qatar, with a ratio of 10:28, or Nepal with both a 3:4 (approximate, not exact) aspect ratio plus an irregular shape.