(Best quote from that page: “Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years.” -- Donald Knuth)
People think that Unicode support is just a matter of implementing multi-byte characters, but it's so much more: you've got collation rules, ligatures, rendering, line-breaking, punctuation, reading direction, and so on. Any technical standard that aims to cover all known human languages is going to be a little bit complex.
> People think that Unicode support is just a matter of implementing multi-byte characters, but it's so much more: you've got collation rules, ligatures, rendering, line-breaking, punctuation, reading direction, and so on. Any technical standard that aims to cover all known human languages is going to be a little bit complex.
Half of those are irrelevant to MySQL or any other database. Those are front end problems. Even reading punctuation, text direction, and the like will only be important in more advanced collation orders (as opposed to just binary ordering, which he was using).
It's true that glyph rendering won't matter to a database store, and that binary collation is "good enough" for most people (but then again, "good enough" is how you get to a UTF-8 implementation that doesn't support 4-byte characters). That said, it's also true that Unicode characters outside of the BMP are still pretty exotic/specialized:
Is it "excusable" that MySQLs implementation of UTF-8 isn't standard? That's a judgment call (they are up-front about it in the docs). But given that most unicode characters "in the wild" lie in the BMP, I can see how they'd make that trade-off. There might well be a technical limitation lurking somewhere in the database internals that made 4-byte characters a problem.
No, it's not excusable - the MySQL project had a trivial alternative: don't call it UTF8!
The options are simple: implement UTF-8 correctly and call your implementation UTF-8, or implement just the BMP and name accordingly. They did neither: effectively, they lied to end-users. That's deeply, deeply problematic.
Bingo. That's why I used the "effectively" qualifier - if my toolchain says "oh this is UTF-8," then I should be able to trust that it's for-real, honest-to-goodness, spec-compliant UTF-8. If it's "oh this is the part of UTF-8 that was easy to implement " instead, then that tool has lied to me. I shouldn't have to read the documentation to find out that something is not what it claims to be.
Bonus points for the documentation brazenly ignoring that they're implementing something that's not spec-compliant and naming it like it is.
Well, to be fair, MySQL has a storied history of implementing 95% of a feature, calling it good enough, and shipping it.
And while, as a Postgres user, my tone here may be a little snide, I also say this with grudging respect: I think there is a point at which implementing n% of a feature X and calling it X (rather than MaybeX or MostlyX) does give you some momentum and practical compatibility that you wouldn't have otherwise. Is it dishonest to hide the limitations regarding the edge cases in some documentation no one will read? Maybe. But will providing the feature solve more problems than it causes? Quite possibly.
I don't agree with MySQL's decision with respect to UTF-8, but I do understand it.
That's an important piece of context, thank you for pointing it out. Engineering decisions occur in a cultural context of mere humans making decisions, and we do well to remember that.
While I don’t know the history of MySQL, it seems to me that when they implemented it, their implementation was indeed in compliance with the standard (Unicode 3).
The standard has since grown from 16 to 32 bit code points.
Why MySQL had to introduce a new name for the UTF-8 encoded tables that can contain 32 bit code points is strange, but I assume there is a technical explanation (probably having to do with binary compatibility with existing tables / MySQL drivers or similar).
A data-loss bug is far more serious than anything about sorting. MySQL has an encoding called "utf8" -- if you can't count on it to round-trip UTF-8 data without data loss, that is a serious problem IMO, documented or not.
The real problem is that MySQL accepts and silently drops this data that it cannot handle. It has a nasty tendency to silently drop data rather than error out.
Really? I've done a lot of data importing into MySQL, and it always gives me a warning that tells me the invalid Unicode code point. It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.
It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.
That right there is the problem.
Other DBs, when you do something they can't handle, bail with a full-fledged error that stops what you're doing. MySQL doesn't do that for quite a few data-losing cases, and that's incredibly dangerous.
Incidentally, that page doesn't mention Unicode emoji (😊), which are probably more likely to get into your average database now that OS X supports them and such.
The one thing boggles my mind is that Unicode is, as a matter of fact, not a universal coding. It's a cluster fuck of multiple encodings (from a same code point). We really should call it multicode, it only solves problems by introducing more. You got problems with Unicode? You are not using the right unicode. Try another unicode.
> The latest published version of the standard (5.0) runs to almost 1500 pages:
Most of those pages are the just repertoire: a printout of a long uneventful table that lists all the available code-points. For each code-point is gives you a reference pre-rendered glyph, the languages where you can find it, the kind of character it is (numeric, alphabetic, symbol) and so on.
From an implementer point of view there are about 200 pages of interesting and extremely detailed stuff. The rest of the pages can be downloaded as text tables from the Unicode site and its companion sites.
The thing is, the vast majority of this isn't needed by most applications. For instance, in a database, all they need is to encode and decode strings properly when going to or coming from external sources. That's all that is being asked for here; that when MySQL says "UTF-8", they really mean "UTF-8", not some broken subset of it.
No it's engineering.
Here are two versions of the software.
1, is faster, better tested and string handling (it's a database!) is much faster but it only handles the 65000 most common characters
2, this one can handle upside down characters from a 1930s paper on formal logic in Turkish. But is slower for all other cases and we haven't really tested it as much,.
Do you have a redundant,self powered , asteroid impact proof internet connection? No? Pathetic !
Sure, in some cases it makes sense to make the tradeoff of not handling more obscure characters. But if the tradeoff is made, the encoding should not be called UTF-8.
"UTF-8 (UCS Transformation Format—8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set," says Wikipedia. The UTF-8 implementation in MySQL does not meet this definition because it cannot represent every character in the Unicode character set.
When MySQL first implemented UTF-8 they probably did support every Unicode character... because there were less than 64K Unicode characters. Then Unicode/UTF-8 was redefined out from under them.
> there were less than 64K Unicode characters. Then Unicode/UTF-8 was redefined out from under them.
Unicode 2.0 introduced multiple planes, i.e. more than 65536 characters. That was in 1996. If that was the case, then MySQL has had more than one-and-a-half decades to introduce multiple planes and seems to have done so less than a year ago. I disagree with being 'redefined out from under them', when it was defined a year after MySQL started, at a time when it probably didn't even have Unicode support yet anyway.
Interesting to note is that MySQL was first released in 1995. Which means that only for one year of its existence were there less than 65536 characters.
Yes, but at the time, UTF-8 could encode up to 31 bits per character using six-byte sequences. It has since been restricted to four-byte sequences at the longest.
> But when did people really start using more than the 16bit unicode chars?
1996.
China even made it a legal requirement for computer systems in 2000, through mandating GB 18030.
There's the Private Use Area if nothing else. There is NO excuse to not support anything other than the BMP. Adding support is trivial unless you have been using UTF-16 in the erroneous belief that it's two bytes long always (in which case you've really been using UCS-2).
According to Wikipedia, the original version of UTF-8 supported >4 byte characters, and was later restricted to 4 bytes by RFC 3629 in November 2003, seven months before MySQL 4.1 was released with Unicode support. (There were 96,447 Unicode characters at that time.)
There's no good reason for one to be faster than the other, though! They're both utf8 encoded sequences of bytes, and there's no good reason to not stream through them as utf8.
Does anybody really think that UCS-2 is a good idea anymore? Or that random indexability by code point is all that valuable, in a world with combining glyphs and bidirectional characters and whatever other crazy stuff Unicode has? If you just want an upper bound on the number of bytes needed to store n code points, then (a) that's probably not a particularly useful question to ask, and (b) if you assume that 32 bits is enough for any code point, then the space taken by properly-formed UTF-8 is bounded.
So, why would they want to store things internally as UCS-2? Or rather, why should they?
Which is in many cases not faster than UTF-8 since UTF-16 is often more bytes than UTF-8. This matters especially since we are talking about a database which means IO and RAM usage probably are more important than the CPU.
I think one of the main points that the article touches upon is: "It's 2012, unicode was invented 20 years ago, why the hell has no one tested this yet, let alone gotten it to work?"
It looks like we're still seeing fallout from the "16 bits are all you need" thing. Maybe telling people that they'd never need to worry about this stuff after adopting (BMP subset of) UTF-8 wasn't a great idea.
The hell of it is, UTF-8 expands gracefully to the astral planes; it's UTF-16 that you need to worry about, either because the people designing the software never heard of surrogate pairs, in which case they didn't give you UTF-16 but UCS-2, or implemented surrogate pairs incorrectly.
This particular limitation of MySQL hit me HARD HARD HARD circa 2008 when I tried to use some of that upside down text you find online as test data, and just couldn't work out why I was getting data corruption.
Luckily Perl's Unicode support is fantastic, and saved my ass
At my previous^2 company we worked around MySQL's limitations by just storing our data in VARBINARY columns, encoded as utf8 on the client-side. Worked like a charm. (I hate MySQL)
I worked on a product for a couple of months geared at minority languages in developing countries - doing linguistics work etc. It was a pain to support Unicode, because there's lots of code points, lots of weird cases (what's capital?), ICU is a good lib but it's not up to date to the latest Unicode version and it's C/C++ (and thus a pain in C#). Oh, and there's the Private Use Area where characters go while Unicode decides to include them or not...
it's almost scandalous how poorly Unicode has been implemented in the most popular development platforms.
Java, for instance, implemented 2-byte encoding and uses surrogates for the higher planes, which means you get the worst of both worlds... You double the size of ASCII text (that is, half the speed of bandwidth-limited operations on text) and you've still got a variable length encoding... but you've got lots of methods and user-written code that assume that the text is fixed length encoded. what a mess
Thats because Unicode is STILL a pain to use. I read the articles that come along about Unicode and still don't understand why handling it is so impossible. Until its transparent for a programmer to use, it won't be as widely used as it should. My apps (and I'm ashamed to admit it) aren't Unicode friendly. But its too much work currently for too little reward to go through and make all that code Unicode friendly.
I've had very few issues making programs I've written unicode friendly, but then I gave up on MySQL about a decade ago.
The only real issue is handling bad input, as you never get an error with decoding e.g. ISO-8859-1; for a lot of applications you need to handle potentially malicious input, so you can do it there, but even for trusted input, there is a lot of really-broken external programs that output "UTF-8" or "UTF-16" (scare quotes intentional).
I really think a lot of the problems with unicode is that a lot of languages/libraries try to handle it transparently, and that just doesn't work; encoding/decoding is part of dealing with external formats, and trying to do it transparently means that it will fail unexpectedly.
Implementing the display, collation or suchlike manipulation of Unicode text isn't feasible for the nonspecialist, to be sure (in practice that means you call an appropriate library/framework/API and hope they got it right).
But storing a stream of UTF-8 and retrieving it on command is not remotely difficult. You almost have to go out of your way to screw that up.
Specifically the talk titled "Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly"
It's a year old now, but it's still relevant. It gives a very detailed look at unicode support across JavaScript, PHP, Go, Ruby, Python, Java, and Perl.
Tom Christiansen's talk Unicode: The good, the bad, & the (mostly) ugly covers a lot of Unicode wrinkles and how Javascript, PHP, Go, Python, Ruby, Java, and Perl handles them.
The root of it is that most things still assume ASCII, etc by default. I griped about this a while ago, as a whole, we should be moving all our tools to being UTF-8 by default, and ASCII being the 'odd' case instead of the other way around!
Apple's "textedit" can't open files with these characters in them. It reports "The document “test.txt” could not be opened. Text encoding Unicode (UTF-8) isn’t applicable."
Yeah, looks like Textmate is simply running the UTF-8 algorithm over UTF-16 code units, so each surrogate is being turned into a single UTF-8 code unit (which decodes to an invalid character).
That's exactly why I use the default ISO-8859-1 encoding for all my MySQL tables, try to only stick ASCII into it, and store any Unicode text as binary UTF-8, encoded client-side. It's stupid and shouldn't be necessary, but at least MySQL can't screw it up.
These 16bits, UCS, wide char ideas are just plain wrong !
Just use UTF-8 for the files and for communication and 32bits code points internally when needed.
Yeah, so what? UTF-8 is Linux standard. I use it, probably most of web use it(where needed). Why I don't see problems with it in software I use and on the web? I can't remember when I had any problem with UTF-8 last time.
This is a MySQL limitation that they have fixed in recent releases (as the OP notes). It's not fair to blame Unicode at large for MySQL problems.
However, it's true that Unicode is (relatively speaking) very new for such a fundamental technology. Support in applications still varies widely. I wouldn't characterize it as cutting edge though, since we have many mainstream programming languages built using Unicode internally.
> This is a MySQL limitation that they have fixed in recent releases (as the OP notes)
TFA notes that this is not fixed, the `utf-8` mysql encoding still isn't utf-8. And as TFA also notes related technologies (aka drivers) may not be compatible with it (the example he uses, mysql2 for Ruby, still hasn't had an official release supporting utf8mb4[0])
> it's true that Unicode is (relatively speaking) very new for such a fundamental technology
That's becoming quite hard an argument to swallow when encountering astral planes issues in 2012 when Unicode 2.0 was introduced in 1996.
> > it's true that Unicode is (relatively speaking) very new for such a fundamental technology
> That's becoming quite hard an argument to swallow when encountering astral planes issues in 2012 when Unicode 2.0 was introduced in 1996.
I don't get your argument. MySQL was also released around that time and we don't call it "cutting edge" because we found a bug. There are bugs in old stuff all the time but (most) people don't throw a fit.
Unicode is being called 'cutting edge' because it's no longer 'old hat.' Lots of things claim support for Unicode, but few (or none) support it well. Unicode isn't a software project, it's a spec/idea. It's like calling a Star Trek tricorder "cutting edge" because no one has implemented a fully-functional version. Sure the idea has been around for a while, but at this point there's no acceptable manifestations of that idea.
Honestly it's probably better they don't change the behavior of an existing MySQL character set. Who knows what software out there depends on it breaking on 4-byte characters, or whatnot.
Creating a new character set `utf8mb4` was the right thing to do, as annoying as it is. Just clearly label the `utf8` collation as 'deprecated' in the docs or something.
http://www.unicode.org/versions/Unicode6.1.0/
The latest published version of the standard (5.0) runs to almost 1500 pages:
http://www.unicode.org/book/aboutbook.html
(Best quote from that page: “Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years.” -- Donald Knuth)
People think that Unicode support is just a matter of implementing multi-byte characters, but it's so much more: you've got collation rules, ligatures, rendering, line-breaking, punctuation, reading direction, and so on. Any technical standard that aims to cover all known human languages is going to be a little bit complex.