For older programmers like me, who pretty much only deal in English, the very simplicity of ASCII can make unicode difficult to grasp.
In ASCII 1 byte (always 1 byte) of ram = 1 character. And the encoding called ASCII matches the mapping called ASCII.
Along comes Unicode (a mapping), which has multiple different encodings. A difficult distinction leading to statements like "that's a unicode string/file/field".
Along comes unicode which has variable bytes per character. (Yes, even for utf-32, which is why no-one uses utf-32). I still regularly come across folk who think unicode means 2 bytes per character.
Along comes unicode which asks you to consider if the functions LEN, SUB, SLICE etc are counting Bytes or Characters.
Along comes unicode that breaks the idea of setting field length in databases as "number of characters", made worse by a cohort brought up on limited storage, who want to be "effecient" in their declarations.
Along comes unicode with code units, code points, characters -all of which come into play, all with different, or variable, lengths.
So while utf-21 might be a "toy", even making something like that yourself is a learning exercise well worth the endeavour. It should be a mandatory teaching exercise. Two thumbs up.
Older programmers like you are not old enough. (-:
ASCII had variable numbers of bytes per glyph back in the 1960s, and the Teletype Model 37 semantics of combining-backspace and combining-underscore still exist in modern operating systems, like the various BSDs and Linux-based ones, to this day. It's a combining-characters multibyte encoding understood by programs from less(1) through ul(1) to nroff(1).
ASCII had multiple transfer encodings, from 8N1 to 7E2, many of which were neither 7-bit nor 8-bit; and 1 byte of RAM was not 1 ASCII character. 1 byte of RAM not being 1 ASCII character is, after all, what brought about "extended ASCII".
Along with that came ECMA-35, with its GR and GL, and code pages. All of those broke theretofore held truths such as that one could just strip the parity bit, actually introduced the false idea (that many people had to keep correcting) that ASCII was 8-bit, and required a full-blown state machine for parsing ECMA-35 escape sequences.
Then there are the tales that old programmers can tell about how they couldn't actually deal in English because ASCII didn't have a pound symbol, and what was "English" was actually American (which was, after all, the "A" in "ASCII").
There is a whole generation of older programmers who laugh hollowly at the idea of "the simplicity of ASCII"; when the amount of time that they spent fiddling with serial protocol settings, transcoding between code pages, handling language variant characters in Teletext, wondering why an 8-bit-clean transport protocol such as TCP didn't mean that SMTP could be 8-bit clean, and doing other such stuff, probably came to years in total spent on all of this "simplicity".
As a beginner programmer back in the day I'd have agreed that unicode is weird and "length in characters" is the right metric for the database entry.
However as a senior dev, who have read and messed around enough with binary, UTF-X, compression/GZIP, etc, I'd say that "character length" for database field size is a weird concept and that "size in bytes" would make more sense since that maps better to what you have in the HDD/SSD/network.
Everyone replying here seems to be quoting "business reasons" which seem very English-centric. Try to talk to a Japanese or Korean person about their "business reasons".
I am suffering the opposite problem. My name has 36+ letters and I live in Japan, where the average full name has "3-4 characters" (Kanji) so they tend to be safe and allow for 10-15 characters, if I'm lucky maybe even 10 for given and 10 for family names (20 in total), where I don't have to totally murder my name and just need to amputate it.
Typically there should be a maximum size in bytes for performance reasons (e.g. how many rows fit in a page) and/or a minimum size in "characters" (very ill defined but usually approximated as code points, with some slack to allow a reasonable amount of combining marks in addition to the desired number of letters), with no guarantee that the two sizes are compatible.
I've often found in practice short UTF-8 columns with a bytes length, that "accidentally" truncated text with ctastrophic effects after the application checked value lengths in characters.
If you want a constraint for such a column, then due to nature of our complex writing, it has to be separate from the actual field type and physical storage allocation - it's the equivalent of an integer field having a constraint that it must be between 1 and 100.
Is there ever a genuine, non-arbitrary business requirement to limit any string to a certain number of unicode characters? And by characters here we probably mean extended grapheme clusters.
If the data is going to get printed in a monospace font, on a passport or credit card or something, then i can understand a limit on the number of characters. But then it's not full unicode either - you want to constrain the length in some specific character set.
Otherwise, i think limits are always arbitrary. I suspect there is a strong cultural holdover from the days of punched cards here.
>> If the data is going to get printed in a monospace font, on a passport or credit card
In the case of limited space, you absolutely can count characters and limit them in the UI. For data storage though you would provision "enough bytes". Think approximately 4 bytes per character, plus a bit extra just in case.
But I agree in most cases the length limits are completely arbitrary and simply made up by the programmer at the time the database is designed. (Hint: They are _always_ too short, especially if the programmer lacks experience.)
Business constraints should be at the server layer (eg api constraints), not the db layer.
You want the name to be max 200 characters, the birth year to be minimum 1900, and the email address to not contain the domain name hotmail.com? Don’t do all this at the db layer! What if you want some users to have different constraints, eg selectively disable some of them?
Nobody cares about them, I am afraid. Most people _really_ can't accept that you can't work on Unicode like you did with ASCII, and do not want to let go of their old C habits (like iterating char by char and doing something).
It makes it less intuitive for the user if you have input fields limited to n bytes instead of to n characters (complications caused by combining characters notwithstanding).
Twitter used(?) to have this problem: emojis, not being in the BMP, would count as two "characters" because they were encoded as UTF-16 surrogate pairs.
That ship sailed long before Unicode. ISO/IEC 8859-1 Included several accents which combine with a preceding character, and that was standardised in 1987. If you only deal in English then remove all accents and map everything down to 7 bit ASCII, if you don’t just deal with English then accept that and deal with it. :-)
Having had to do some internationalisation of code written by Americans that couldn’t even handle 8 bit characters I am fine with spreading this bit of pain a little more evenly round the world.
Are you sure about that designation? I don't remember 8859-1 having combining characters and other sources (https://en.wikipedia.org/wiki/ISO/IEC_8859-1 for one) seem to agree and also list it as being officially ratified later than '87 ('98).
In ASCII, the concept of a character is overloaded with many properties. Unicode breaks "character" out to separate concepts like code unit, code point, grapheme cluster, etc.
Because of this, some of your statements are invalid. For example, "Along comes unicode which has variable bytes per character. (Yes, even for utf-32, which is why no-one uses utf-32).".
UTF-32 has a fixed number of bytes per code point. It doesn't have a fixed number of bytes per grapheme cluster.
ASCII is a multi-byte encoding system. Yes, really, it is. The character set was designed specifically so that overstrike can be used to express a variety of characters from Latin scripts: lower-case ones with cedille or accents. E.g., to encode 'á' (á) you'd output "a<BS>'", where BS is the ASCII backspace control character. This comes from the days of typewriters, where this is how people would print accented characters and cedilles. The lower-case restriction is why Spanish makes accents optional on upper-case letters.
Most of this overstrike functionality was lost long ago except for underline and bold, which remain in use in terminals (and `nroff`, and...).
UTF-8 is 30 years old, and 20 years ago it was already in common use on the web. I know because I was the doofus who pushed PHP projects to adopt it because monolinguals kept fucking up character encodings.
Being an older programmer is a very poor excuse at this point to not understand Unicode.
True. And for those of us around who "learned" unicode in the 90s, when unicode was UCS2, it can be hard to let go of those "truths that are no longer true." This complicates the learning process now.
> In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points,
So that seems like a fair analysis. I know we still did latin-1 DBs here into 2005, but I think you had to choose charsets at that point. Everyone knew you should do Unicode but there were so many excusese not to. Unicode still is harder than ascii because of that mindset.
const char *const end = str + strlen(str);
for (const char *it = str; it < end; ++i) { .. }
is almost always wrong unless your locale is C and you explicitly say you only work with ASCII. It's wrong with UTF-32 and char32_t too, because as the post suggests Unicode codepoints are not glyphs - you can have glyphs spanning over multiple codepoints, with multiple normalization forms. This is particularly tricky, because characters printed the same can be represented in multiple ways and it's hard to discern them, i.e. è can be a single codepoint or a combination of a '◌́ ' modifier and 'e'.
In general the trivial C assumption "strings are just char arrays" most people are accustomed to is dangerous and broken in Unicode. str[x] in Unicode is wrong 99% of the time, thus UTF16 and 32 are nonsense¹ and either you treat strings as bytes you just pass around, or you need a full Unicode library such as ICU to operate on Unicode text safely.
¹: in my experience they are NEVER used mostly, UTF-16 is actually two incompatible encodings and use it as a way to "UTF-wash" the broken UCS-2 they so eagerly adopted in the '90s (including Java, Windows NT, Cocoa, Qt, Unreal, JavaScript, [partially] Python, ...) .
UTF-16 is the internal encoding of ICU to this day. If you're using ICU, you're using UTF-16. The library treats UTF-8 as a conversion target rather than a native representation. If you ever see a new project pick UTF-16 and you don't know why, it's because of ICU; any other choice forces a round trip conversion on every ICU call. If you pick UTF-16 you can just use icu::UnicodeString as your string representation and life is easy.
If str is utf-8, as it likely is nowadays outside certain OS APIs, then that code is fine. You can't grab an arbitrary character from the str and move it around and have it retain its meaning. But you couldn't necessarily do that before either, as you would be potentially breaking up a word, eg. To do unicode while pretending it's ascii in C, you look for ascii characters you recognize, like punctuation and the like, which you split the string on. You then treat every other substring as a black box of characters that can only be moved around as a unit.
So, as someone who has had to briefly dabble in Unicode muck some years ago: how would I properly iterate through Unicode? How do I know how many bytes to iterate before I get a proper glyph instead of part of a glyph, or worse: a glyph and some extra codepoints?
Is the concept so esoteric that it's best to just find a blackbox library and not worry about the finer details?
Either you use a (big, complex) library like ICU or you just don't.
People have been thought for too long that strings are arrays of characters - they are not, they are array of bytes, and bytes may mean lots of things.
If you need to iterate or randomly access a string you should ask yourself why are you doing it. Any code that attempts to operate on codepoints will inevitably only ever work properly for ASCII. If that's fine for you (for instance, you want to match all text between a pair of <>) then you can probably work on raw bytes and UTF-8. If that's not the case, you need a library to handle the million different cases related to supporting RTL, invisible characters, pictograms, combinations, ...
Never use wchar_t, and even things like char16_t and char32_t have very few correct usecases to be honest (like implementing computing sizes, etc).
Even counting characters is broken, `wcslen(L"Salò")` may still be 4 or 5 depending on the current normalization form. The correct solution is to take strings as black boxes, array of bytes you only access through APIs (like binary data).
It is also high time we get rid of obsolete concepts like case sensivity - it just doesn't work at all outside of regional code pages (try doing `"ciao".toUpperCase().equals("CIAO")` in Java with locale tr_TR.UTF-8 and look what happens).
To do it correctly, you decode the unicode. Then you apply whatever algorithms you need. Decide things like "are the diacritics on combined characters extra glyphs or not" and "how many characters is '﷽'" and "ௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌ is technically a single glyph built around several character combinations but it's as long as a sentence". Actually, you don't want glyphs, you're probably interested in grapheme clusters.
You can try to be smart and measure the width of the rendered string, but then you find out that Mongolian is written top-to-bottom and that Zalgo allows for freaky combinations that break the line boundaries.
The real answer is the answer to the question "how do I deal with timezones in my code": you use a library by people who already put the effort in. Take something like ICU and let that deal with the text. If you need to ask the question, you're probably not a programmer with a minor in linguistics so letting the people who do know do the work makes everyone's lives a lot easier. Just make sure to read the documentation right.
If you want a "glyph" you are trolling yourself because selecting and assembling glyphs in fonts is an additional layer of punishing complication on top of Unicode itself. Correct text rendering requires passing as much text as possible (even multiple lines) to a "black box" library that figures out layout, ligatures, glyph replacement etc.
If instead you want to find, as is more likely, whole grapheme clusters that can be treated as editing units (e.g. select/copy/delete "the character after the cursor", move the cursor N "characters", count "letters" in a string) you can go forward through your string, codepoint by codepoint, and decide depending on how those codepoints are classified where a "character" ends. Your ideas and requirements about the criteria might or might not match those of your libraries and of the Unicode standard itself.
You use "icu::BreakIterator::createCharacterInstance" from the ICU library to get a character-walker. In .NET, you have "StringInfo.GetTextElementEnumerator". You must use a Unicode library to walk "grapheme clusters" (what users recognize as individual characters). There's no other way to do it correctly.
You don't go far enough. Lets do some language reform. The entire alphabet is always just 5 bits expressed in different mediums. Lets make the written characters are all based on 5 sticks coming out of a central point literally representing your hand. Sign language, pretty obvious. Morse code, the up / down fingers directly translate to holding the key down or not. Braille, same as Morse. Digital displays, 5 segments instead of 7. Flag signals, mildly tricky since you only have two arms, but it can be done by mirroring the symbols so that there's 10 positions (top and bottom half), and then the bottom flag positions represent "not here". Alphabetical order is just interpreting the bits as ints. Literally any every way we have ever historically held and communicated natural language unified into a single general purpose encoding.
Why settle at simplifying computer codes, when you can simplify the content at the source.
None of your complaints make any sense. If you limit your usage to ASCII characters, then all of your complaints of Unicode no longer apply. E.g. character length matches your intuition of “number of characters”.
Firstly we live in a world where limiting yourself to ASCII characters is increasingly untenable. There are plenty of non-ascii characters in Spanish or European names.
So your argument about limiting to ASCII is like saying its also not a problem if we don't use computers. Sure, you're right, but it's not terribly practical.
Plus your argument uses the word Unicode (a mapping) where it only applies to utf-8, (a specific encoding ). This inaccurate use of the terms is the very issue I'm alluding to.
Incidentally I'm not complaining about Unicode. Unicode is necessary and good. My comment is that (mostly older, but not exclusively) programmers find it hard to grasp the nuances of unicode and think of it as a encoding, not a mapping.
> Firstly we live in a world where limiting yourself to ASCII characters is increasingly untenable.
I agree. But your original comment bemoaned the difference between using ASCII and Unicode encodings. I was just pointing out that any Unicode encoding is not variable length of you limit yourself to ASCII. This applies to all Unicode encodings, not just utf-8.
I might not have been clear in the original posting. I wasn't intending to bemoan the difference, I was bemoaning old programmers (like me) not being familiar with the most basic unicode concepts, which leads to lots of issues.
Yes, if you limit your strings to ASCII then they are always 1 byte long (utf-8), or 2 bytes long (utf-16) or 4 bytes long (utf-32). So you still need to know which encoding is in play. But you can then randomly access "the 4th character".
My follow up points out though that this is meaningless. Because in the real world you cannot limit your strings to just ASCII in the first place. So your point is moot from that perspective.
As an older programmer who deals with multiple languages, the very complexity of ASCII in a multi-language context does make unicode seems like a breath of fresh air.
Seriously, multi-language in the 90s was a hell of half-baked incomplete and incompatible solutions.
wchar_t is a massive mistake that came from an era (the '90s) where people wanted to delude themselves they could keep ASCII-like code ergonomics (iterating, random access, ...) by just changing character types.
It was a stupid mistake we are still paying to this day, when OSes and libraries made in the '90s are forced to convert from the UCS2 they use internally to UTF-8 constantly (just look at Qt).
Isn't this similar to what modern languages did? They abstracted away the underlying encoding so that the programmer deals with characters instead of bytes. Two examples are Python and Javascript strings. They kept ASCII-like code ergonomics (iterating and random access).
That of course required that they separated strings and byte-like objects -- like uint8_t and wchar_t.
Isn't signed char actually the culprit in modern C and therefore useless? It's main use is ASCII and that's obsolete.
> so that the programmer deals with characters instead of bytes
This is a broken approach and a broken mindset. Modern languages like Rust (Python is not modern anymore, it is ATM 33 years old which is older than C was in 1990) reverted back to "a string is an array of uint8" because that's the only sane way to operate on them. Naive iteration and random access are broken unless they are performed on the underlying bytes, because iterating on Unicode characters is a *broken concept*.
Python strings are also arguably somewhat broken, because they still allow random access into a string at "character" (not byte) indexes which causes all sorts of issues when slicing. This means that code that works perfectly with English text will malfunction when handling other languages.
The hard truth is that slicing an Unicode string is a non-trivial and (somewhat) expensive operation, while Python slicing was designed in Python < 2.7 with the assumption char == byte, which now is always broken in every encoding except 8 bit, single codepage ones.
For instance:
>>> print(s)
'Crêpe'
>>> [unicodedata.name(c) for c in s]
['LATIN CAPITAL LETTER C', 'LATIN SMALL LETTER R', 'LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT', 'LATIN SMALL LETTER P', 'LATIN SMALL LETTER E']
>>> len(s)
6
In this case, the string "Crêpe" is in NFD form (all decomposable characters are decomposed; in particular `ê` is not U+00EA 'LATIN SMALL LETTER E WITH CIRCUMFLEX' but '\u0065\u0302', which is U+0065 'LATIN SMALL LETTER E' plus U+0302 'COMBINING CIRCUMFLEX ACCENT' ( ◌̂ )
In Rust, which is more modern, enforces UTF-8 (and not some broken version of UCS-2 or worse) slicing is ALWAYS done on bytes because it doesn't make sense to perform it on "codepoints". Asking for a slice with a byte range that falls in the middle of an UTF-8 multichar sequence will cause a panic; this also means that if you do s[2] you get the second _byte_, not "char". If you want the second character, you are forced to go through a Unicode library, as you always should.
Python will instead happily comply, basically returning only the codepoints you asked, because it sees a string composed of 6 Unicode codepoints despite the fact the user only sees 5 rendered on screen:
>>> s[0:3]
'Cre'
>>> [unicodedata.name(c) for c in s[0:3]]
['LATIN CAPITAL LETTER C', 'LATIN SMALL LETTER R', 'LATIN SMALL LETTER E']
This makes string slicing basically useless, because even if you normalise all strings using `unicodedata.normalize('NFC', s)` before performing slicing on them, there are still several printable characters which are not represented with a single Unicode codepoint.
For instance,
>>> eu = ''
>>> len(eu)
2
>>> x[0:1]
''
because all flag emojis are represented with two Unicode codepoints, each one representing a letter.
TL;DR: do not use wchar_t, UTF-16, UTF-32, ... only use UTF-8 if possible and under all circumstances treat strings a special versions as black boxes specialised for text you can only access byte per byte. If you need to do text operations, use a library like ICU or whatever your language/repository provides.
Thanks for the examples, your point makes sense to me now. Combining characters or modifier characters and all other weird aspects of unicode really call for a separate library for parsing unicode, since even with relying on 1 UTF-8 codepoint == 1 text unit, one doesn't get much advantage, because 1 character isn't always 1 UTF-8 codepoint.
> the very simplicity of ASCII can make unicode difficult to grasp.
Somewhat related: I hate "smart" "quotes". I find myself editing them into plain old << " >> and << ' >>.
If you want to use "smart" quotes to add semantic meaning to text - as in "this is a quote" - then you should be using markup to do the job. Stop abusing non-ASCII typography so you can pretend to write "semantic" material that you think will impress all the cool kids that know structured documentation.
Who else remembers that great Perl script, "The Demoronizer" ?
I too hate “smart quotes”, meaning anything that automatically converts from ASCII, but only for myself. That’s because I’m a weirdo that types exactly what I mean, things like em dashes and en dashes and curly quotes and such, using my Compose key.
Opening and closing quotation marks are different from one another (simple fact, true from the very early days of quotation marks), so they should be represented differently in Unicode (simple fact). Anything along the lines of “smart quotes” is just a pragmatic approach to let normal people get the correct quotation marks, and I’m mostly glad they can have it (I just wish it’d handle things like “’cos” and “’90s” properly, which use an apostrophe which is equivalent to a right single quotation mark; but other than that, I’m glad when people use curly quotes rather than straight quotes), so long as I don’t have to have it (and I’m glad to say I don’t).
These are all somewhat legacy, having never seen very wide adoption:
Compression schemes such as SCSU and BOCU-1 turned out in practice to add a lot of complexity for minimal real-world benefit. If you really need to save storage, then a standard compression algorithm (e.g. deflate or zstd) is usually a better option.
CESU-8 is a backward compatibility hack for old UTF-8 implementations which used encoded UTF-16 surrogates for codepoints outside the BMP.
UTF-EBCDIC is basically UTF-8 but for EBCDIC. It was invented by someone at IBM, but IBM themselves ended up standardising on UTF-16 instead. And if IBM ended up not using UTF-EBCDIC, who else was going to? Well, there's a few surviving non-IBM mainframe vendors who also use EBCDIC (such as Unisys and Fujitsu BS2000), but I'm not aware any of them ever expressed any interest in it either.
Almost nothing uses UTF-EBCDIC by default, db2 (and the underlying VSAM persistent storage) mostly ends up writing UTF-16 (but you can set it up to write UTF-8); except for internal catalog data and network traffic, that are usually UTF-8 no matter what.
Do any IBM products have UTF-EBCDIC as a supported configuration option? Does anyone use it?
Oracle RDBMS supports "UTFE" on EBCDIC platforms, which is to UTF-EBCDIC what CESU-8 is to UTF-8 – i.e. UTF-EBCDIC but with codepoints outside the BMP encoded using surrogates. Over the last couple of decades, the only EBCDIC platforms Oracle RDBMS has been supported on have been z/OS and BS2000/OSD – once upon a time it was also available on VM/CMS (not sure when that was discontinued), and possibly more besides (in the 80s thru early 90s, Oracle ported their DB to just about everything under the sun, they later became much more selective in what they'd support). The z/OS port was discontinued with version 10gR2 (initial release 2005, final patchset 2010) – although, while Oracle normally puts a time limit on their pay-extra support for patches for old versions ("extended support"), exceptionally they've said for the z/OS port they'll continue offering that as long as customers want it. So the only remaining EBCDIC port of Oracle RDBMS is Fujitsu BS2000/OSD – I think the only reason that port survives is because Fujitsu pays Oracle to keep on producing it.
mksh (MirBSD's ksh) supports some custom EBCDIC encoding of UTF-8 (so-called "nega-UTF-8"), instead of proper UTF-EBCDIC. One difference is that UTF-EBCDIC avoids using the C1 controls (bytes 0x80 thru 0x9F), so EBCDIC control characters can be left intact, whereas "nega-UTF-8" just converts UTF-8 to EBCDIC using an arbitrary EBCDIC code page, hence failing to fully preserve the full complement of EBCDIC control characters. Also, while UTF-EBCDIC is based on a fixed EBCDIC code page (1047), "nega-UTF-8" isn't, so it really isn't a single encoding, rather a family of them.
Has anyone else ever implemented UTF-EBCDIC, or something similar (given neither Oracle UTFE nor mksh's nega-UTF-8 are actually UTF-EBCDIC)? Years ago, I wrote an implementation of it myself (in Java), which I've never released – not for any work-related purpose, just as a private exercise in recreational programming. Actually, I have an FTP server I wrote which serves up UTF-EBCDIC as "TYPE E", but I could never test it properly, because (at the time) I couldn't find an FTP client which actually implemented TYPE E. I think I later found one, but by then I'd lost interest in pursuing it any further.
If you look at the structure of UTF-8, it is based on a prefix code: each byte starts with a string of 1s followed by a 0, where the total number of bytes in the sequence is the number of initial 1 bits in the first byte – except for single-byte sequences, which start with a 0 bit, and the 10 prefix instead of being used to introduce single byte sequences is instead used to mark continuation bytes. There is no reason why UTF-9 could not have adopted the same structure – 0xxxxxxxx for U+0000 thru U+00FF, 110xxxxxx 10yyyyyyy for U+0100 thr thru U+1FFF, etc, but instead the UTF-9 author decided to go with a much more simplistic scheme of "initial bit is 0 for initial byte, 1 for continuation byte".
UTF-18 is also flawed, because it has no concept of surrogates, hence it cannot represent the full 21-bit repertoire of Unicode characters. Once again, there is no technical reason why it could not have supported the use of surrogates to achieve this.
I suppose it doesn't really matter because it was just an April Fool's joke not a serious proposal – but if one was to make a serious proposal for a UTF-9 and UTF-18, I think it would have to fix these two flaws in RFC4042.
One aspect the article is omitting is that UTF-16 and UTF-32 aren’t strictly speaking encodings, but UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE are. Furthermore, with a 21-bit packed encoding like UTF-21, not only byte-level endianness is relevant, but inner-byte bit-level endianness becomes relevant as well. While the encoding examples given imply a particular endianness, this is something that would have to be specified more explicitly.
Endianness is such a pain. I wish everything had always stayed big-endian, the generally superior option (though I will acknowledge it’s not quite uniformly superior). Implement any bit-level algorithms and endianness tends to rear its ugly head. I implemented the Speck cipher last year, and it basically mixed little- and big-endianness (byte/bit expression), but the paper never mentioned this at all, so that trial and error on the sample vectors, or emailing the authors for clarification, was the only way you’d find that out!
The article gives examples that assume big-endianness, but in practice, UTF-16 is almost always UTF-16LE (no idea about UTF-32 since it’s not used so much), so the bytes within each code unit are, for practical purposes, back to front.
The Unicode space used to be 31-bit. UTF-8 defined 5 and 6-byte encoding sequences for this reason. When UCS-2 had to be extended to create UTF-16, the solution chosen effectively limited things to a 21-bit space. Unicode was then redefined to be 21-bit, and the 5 and 6-byte encodings of UTF-8 were removed from the standard, though old systems might still accept them.
It is worth noting that if we ever run out of space in 21-bit Unicode, one could conceivably re-extend it back to the original 31 bits by throwing UTF-16 under the bus. The only reason Unicode is limited to 21 bits is because of UTF-16.
By throwing UTF-16 under the bus, or by extending it, in the same way that UCS-2 was created by extending UTF-16.
Just like a number of UCS-2-representable codepoints in the basic plane were given up, in order to allow that number squared of non-UCS-2-representable codepoints to be representable in UTF-16, a number of UTF-16-representable codepoints in a supplemental plane could be given up, in order to allow N^2 non-UTF-16-representable codepoints in not-yet-existing supplemental planes to be representable in UTF-16-but-worse.
Yes, I'm talking about putting surrogates in the surrogates.
There's also no reason to assume that if you are throwing out UTF-16 that you at that point won't also also be planning to extend past 31-bytes and out into the next plane (out into 63-bits).
Assuming UTF-32 is UCS-4 and will never address future planes is how Unicode got into the situation that UTF-16 isn't infinitely expandable and is stuck at the 21-bit boundary.
Look, we’ve currently got about a million code points, and a trajectory and principles that should leave that easily enough for at least five thousand years, by my simplistic estimate <https://news.ycombinator.com/item?id=32286193>. There’s no point in forecasting that far in advance.
And you’re concerning yourself with filling a space of two billion code points? A Unicode that would be in any danger of doing that would not be in the slightest bit recognisable as the Unicode of today.
(Yeah, I know they once thought they could squeeze Unicode into 65,536 code points. I honestly don’t understand how they ever thought that would be sufficient even with Han unification, and I wish they hadn’t made that misstep that gave us the abomination that is UTF-16. But they knew running out was a hazard from the start, they just miscalculated the risk. No one that examines the matter is concerned about even the slightest risk of the current million code points being insufficient in the usefully-predictable future.)
I'm just saying it is better to make no assumptions here than to make the same mistakes that led to UTF-16's accidental brokenness.
There are certainly jokes that emoji's current rate of expansion could be in the billions by the end of the decade.
Every time we think we have a handle on "all" of human languages we make new interesting discoveries of past languages or hidden present languages or an indigenous populace declares a need for a script they can truly call their own rather than whatever ugly mix of Latin letters some missionaries thought appropriate in the cultural insensitive past or a populace has an uprising and declares a revolutionary new start or…
That's just human languages. I'm sure we'd have a ton of problems if we ever started trying to encode non-human languages. I don't think we're in danger yet of discovering that dolphins have a written script for whale languages (or wilder some sort of first contact event with non-terrestial aliens), but our science fiction loves to posit such events happening at any time and as a huge surprise. It would be great if we were at least somewhat prepared for such wild hypotheticals.
Yeah, that's outside the "usefully-predictable future", but also I think Unicode has always existed in a place where the future isn't "usefully-predictable". (How many 90s Unicode architects would have predicted emoji? As only one example.)
"UTF-16 native" doesn't mean your "UTF-16 unit" (i.e. what would `charCodeAt` return) should not exceed 16 bits. JS implementations already do not use UTF-16 as a sole native representation anyway, so the migration should be relatively easy.
> JS implementations already do not use UTF-16 as a sole native representation anyway
But they still expose UTF-16 code unit semantics: U+10000–U+10FFFF come through as two UTF-16 code units, a surrogate pair.
> "UTF-16 native" doesn't mean your "UTF-16 unit" (i.e. what would `charCodeAt` return) should not exceed 16 bits.
True enough. You could redefine it to have UTF-16 code unit semantics until U+10FFFF, then code point semantics beyond, maybe call this “UTF-16+ code unit semantics”. It’d definitely break a few things here and there, but then, anything would break a few things here and there, regardless of encoding type, since the range U+0000–U+10FFFF is hard-coded in so may places. Well, anything except doing surrogates again. And I’m pretty sure that’s what would be done.
Exactly this. In fact I first thought of https://ucsx.org/ then realized that JS strings do allow lone surrogates, and then further realized that none of this don't matter at all because the external storage format is entirely disconnected to the JS semantics. So as long as we can agree on which storage format to standardize and upgrade everything accordingly (which would be really a drama), JS wouldn't make an additional dent.
> I think UTF-32 is the simplest. Each number is put into a 32-bit integer, or 4 bytes. This is called a “fixed-width” encoding.
This is wrong. The 4-byte fixed encoding is UCS-4, and while UCS-4 currently encodes all of Unicode, there's no guarantee it will "forever". Just as people who assumed UCS-2 was "all" of Unicode have since been proven (very) wrong.
UTF-32 is a variable width encoding even if right now there's no reason to use surrogates out to the next plane (particularly because the next plane is entirely inaccessible to UTF-16 and only UTF-16, but UTF-16 is still one of the world's most common encodings).
UTF-32 is a fixed-width encoding. To quote directly from the Unicode standard, chapter 3.9, definition 90:
> UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.
UCS-4 is defined in ISO 10646. Before the 2011 revision of the standard UCS-4 had a 31-bit codespace, since then the definitions of the ISO and Unicode standards have converged and UCS-4 and UTF-32 are now synonymous.
> while UCS-4 currently encodes all of Unicode, there's no guarantee it will "forever".
There is. Unicode has a codespace of U+0000..U+10FFFF. There are 2^16 + 2^20 − 2^11 assignable codepoints, in Unicode 15 a total of 149,186 (13.4%) have been assigned.
Even if the Unicode codespace were to ever be extended again (it won't), the only encoding that would become incompatible is UTF-16. In fact, both UTF-8 and UTF-32 are trivially extensible and used to be wider encodings, but were restricted to 0x10FFFF arbitrarily to match UTF-16 limitations.
> Even if the Unicode codespace were to ever be extended again (it won't), the only encoding that would become incompatible is UTF-16. In fact, both UTF-8 and UTF-32 are trivially extensible and used to be wider encodings, but were restricted to 0x10FFFF arbitrarily to match UTF-16 limitations.
Mind you, it wouldn’t be this easy, because things should perform Unicode validation, and many do, so every piece of software would need to be updated for the new, enlarged version of Unicode, UTF-8 and UTF-32, and old software that validated would baulk or convert anything from the new ranges into REPLACEMENT CHARACTER.
True, another extension would be disastrous, we may never recover from the fallout of malformed UCS2-esque UTF-16. Let's just hope no one fixes all the broken, accidentally forward-compatible decoders in the practically infinite amount of time it will take to completely fill the Unicode codespace.
I really don't see any advantage to UTF-32 when not-officially-standardised UTF-24 has the same constant 3-byte-sized codepoints (and multiplying by 3 is not hard - it's n + 2n); in UTF-32, the highest byte will never be anything other than 0, so it's essentially permanent waste.
Also, according to https://en.wikipedia.org/wiki/List_of_Unicode_characters there's currently less than 150k codepoints defined, so even 21 bits is several times larger than necessary --- 18 bits will contain all the currently assigned codepoints, and be sufficient until 256k codepoints is reached.
UTF-32 is easy to understand for educational purposes but it's probably a mistake to use it as a real string representation, and almost nobody does. Code units and code points are the same thing in UTF-32 but they're different in UTF-16 and UTF-8; you can teach someone UTF-32 before they understand the distinction. Obviously, UTF-24 isn't used because it isn't a standard encoding, and if you really wanted to save memory, you'd use UTF-8 instead which is even more compact yet.
As for UTF-16, today the only reason people choose UTF-16 for new projects is because it's the native internal encoding of the ICU library. If you're not using ICU, it's pretty hard to defend anything but UTF-8.
> I really don't see any advantage to UTF-32 when not-officially-standardised UTF-24 has the same constant 3-byte-sized codepoints (and multiplying by 3 is not hard - it's n + 2n); in UTF-32, the highest byte will never be anything other than 0, so it's essentially permanent waste.
Wouldn’t the difference in alignment (4 byte versus 3 bytes) make UTF-32 faster than UTF-24 in certain cases, on certain CPU architectures? So the always zero byte would be wasting space to gain greater performance.
People use UTF-32 sometimes because computers have 32-bit ints but not 24-bit ints. If you want a single primitive type to represent a code point, that's gonna be a 32-bit int. If you make an array of those, that's UTF-32.
To fit all currently assigned code points within 18 bits is easy: you would only have to move one range.
Above 32FFF, all assigned code points are within E0000 to E01EF, which fits in-between 32FFF to 3FFFF with room to spare.
Those code points are used for flag emojis and for selecting uncommon CJK variants. If you don't support those, you could just strip out anything that doesn't fit in 18 bits to begin with.
It's funny: for all I knew when reading the post, UTF-21 was something I'd never seen before. But when I saw your comment, I remembered that I must have been quite aware of it back in 2015, when I golfed a Funciton Hello World program [0]. Seven years is a long time for such a silly thing to circle around again.
Out of curiosity, is there a more compact alternative to Unicode that excludes all precomposed chars like ñ, but also for all the Asian languages, so you'd basically have some set of "strokes" that can compose anything else? And if you need to add semantic meaning to identically looking results, that would be another non-visual composable indicator
I'd think that such an encoding would waste encoding space on stroke combinations that are not valid, and therefore not be more compact than Unicode in practice.
I think that the TRON encoding supports vector graphics in addition to all Chinese, Japanese and Korean Han/Hanja/Kanji characters as separate code points.
The vector encoding is for spelling of names (family names and places) that you wouldn't find in dictionaries.
I should first point out that you probably meant Han characters (so called "CJK(V) ideographs", an unfortunate misnomer), and many Asian languages do not use them at all if we are talking about the number of languages.
But yeah, let's talk about strokes first. Almost all Han characters do consist of a small finite number of strokes and the stroke order itself is reasonably canonical... only for a single point of time in one area. There are lots of examples where the stroke order or even the stroke composition itself differ from time to time and among countries. And you won't be able to correctly encode an archaic character which canonical stroke order is not known---you have to guess, and you will be frequently wrong but wouldn't be able to fix that.
So the next possibility is something like character description languages [1], and people tried a lot of them, so much that we even have a standardized description sublanguage in Unicode [2]. And again, there are so many possibilities that we can't have a single canonical description. Andrew West has maintained a comprehensive IDS database for all Unicode Han characters [3], and while sometimes the first level of decomposition looks clear (字 is clearly 宀 + 子) subsequent decompositions are much more ambiguous (is 子 atomic or 了 + 一?). Ultimately this approach takes too much work to be worthy, so most standards including Unicode have treated them as atomic characters with a room for glyphic variations.
[2] But those ideographic description sequences (IDS) are meant to describe characters (like, ⿱丶王 means 丶 over 王) and not equivalent to characters themselves (⿱丶王 doesn't equal to 主).
If you know exactly what text you're going to render, you can create your own font, relocate all the characters, and fit everything very compactly into UTF-8. You'll need to convert external input to your encoding scheme, but it can save a lot of bytes for some corpuses.
If you try to standardise against "what things look like", you'll inevitable run into trouble like the famous Turkic I (I (Iı) vs İ (İi) vs I (Ii) where "İ".toLower().toUpper() != "İ").
Sure, this means characters like é can theoretically be stored as either 0xE9 or 0xC3A9, but if you can pick what encoding you use, you can also optimise for the smallest length.
And there's a lot of ways to put together a character, so I worry you'd end up with something closer to a vector format than a normal character encoding.
Much easier said than done. Try to actually do that. A good start would be IDS.TXT I've linked above; this gives about 500 approximately atomic characters to start with [1]. Now extend this to the entirety of Ideographic Variation Database [2], which tries to solve the most cited problems with Han unification. And then add some kind of semantic annotation as you've suggested (I have no idea, maybe you have a better idea).
[1] 123 components that are already encoded, 121 components ({01}, ...) that are partial characters, ~123 "unpresentable" components (?), and ~115 "minor" variations (〾) that may require additional components or two.
> Much easier said than done. Try to actually do that
Ha-ha, thanks for the offer, but I'll pass - all that complexity is precisely why I'm not even going to try, it requires a lot of dedicated team effort put into it, a simple "try" is doomed to fail.
I worry that "1 step before that" requires a database of information about how to actually put the pieces together for a huge fraction of characters. And that without the database it doesn't work right, and with the database it's like the current system with extra steps and extra layers of complicated abstraction. You'd be able to encode some novel characters but that's a high cost.
The original Droid Sans Fallback is exactly that. Caveat is it still has Han Unification problem(all Asian characters with "similar" shapes are each comingled into single code points across languages, making a "unitary Asian" font impossible).
I feel like the obvious 21 bit encoding is to pack 3 codepoints into 8 bytes.
But uh I guess this works.
If you want to do something more in depth and make real decisions about encoding, a good space to explore is 1-3 byte variable width encodings. You have lots of different tradeoffs to consider and you can make something surprisingly good.
I'd appreciate an analysis of how it compresses! The encoding looks highly compressible, so I'd expect it to be competitive with UTF-8 for English text, and seems like it would beat it for East Asian languages.
Arithmetic coding goes one token/symbol at a time, just like most kinds of compression. The fractional bits come after token selection, and aren't really relevant here.
You can split the input into tokens that aren't a multiple of 8 bits, sure. But that's its own decision. 7 or 21 or whatever bit tokens could be fed into a huffman tree just as easily.
Arithmetic compression uses whatever on the output. Of course you can retokenize weird input but you can usually do so for any algo if you can modify it. But UTF21 can not have a substantial advantage if you compress. It will usually be worse.
With UTF-7 being a requirement for certain (outdated) email protocols and programs intentionally sabotaging their UTF-7 code paths (most importantly .NET 5), UTF-7 will remain relevant for anyone dealing with email software for a while. UTF-7 is just one of the many ways in which email can screw you over.
SMTP can't reliably handle UTF-8 so your options quickly devolve into base64 encoding messages or falling back to UTF-7 encoding. Base64 is a lot more wasteful than the alternative, as cursed as UTF-7 may be. For full interoperability, some mail servers even require you to be able to deal with UTF-7 wrapped inside another transcoding format!
If you exclude surrogates, I think you can fit the rest of the codepoints in 20 bits. There would be no need for surrogates all Unicode scalars are representable elseways. Looks like an easy size win.
In particular, UTF-16 surrogates use 2048 code points in the basic multilingual plane (BMP), but surrogates allow another 2^20 code points to be encodable in addition to the non-surrogate 65536−2048 code points of the BMP.
In ASCII 1 byte (always 1 byte) of ram = 1 character. And the encoding called ASCII matches the mapping called ASCII.
Along comes Unicode (a mapping), which has multiple different encodings. A difficult distinction leading to statements like "that's a unicode string/file/field".
Along comes unicode which has variable bytes per character. (Yes, even for utf-32, which is why no-one uses utf-32). I still regularly come across folk who think unicode means 2 bytes per character.
Along comes unicode which asks you to consider if the functions LEN, SUB, SLICE etc are counting Bytes or Characters.
Along comes unicode that breaks the idea of setting field length in databases as "number of characters", made worse by a cohort brought up on limited storage, who want to be "effecient" in their declarations.
Along comes unicode with code units, code points, characters -all of which come into play, all with different, or variable, lengths.
So while utf-21 might be a "toy", even making something like that yourself is a learning exercise well worth the endeavour. It should be a mandatory teaching exercise. Two thumbs up.