It's true that glyph rendering won't matter to a database store, and that binary collation is "good enough" for most people (but then again, "good enough" is how you get to a UTF-8 implementation that doesn't support 4-byte characters). That said, it's also true that Unicode characters outside of the BMP are still pretty exotic/specialized:
Is it "excusable" that MySQLs implementation of UTF-8 isn't standard? That's a judgment call (they are up-front about it in the docs). But given that most unicode characters "in the wild" lie in the BMP, I can see how they'd make that trade-off. There might well be a technical limitation lurking somewhere in the database internals that made 4-byte characters a problem.
No, it's not excusable - the MySQL project had a trivial alternative: don't call it UTF8!
The options are simple: implement UTF-8 correctly and call your implementation UTF-8, or implement just the BMP and name accordingly. They did neither: effectively, they lied to end-users. That's deeply, deeply problematic.
Bingo. That's why I used the "effectively" qualifier - if my toolchain says "oh this is UTF-8," then I should be able to trust that it's for-real, honest-to-goodness, spec-compliant UTF-8. If it's "oh this is the part of UTF-8 that was easy to implement " instead, then that tool has lied to me. I shouldn't have to read the documentation to find out that something is not what it claims to be.
Bonus points for the documentation brazenly ignoring that they're implementing something that's not spec-compliant and naming it like it is.
Well, to be fair, MySQL has a storied history of implementing 95% of a feature, calling it good enough, and shipping it.
And while, as a Postgres user, my tone here may be a little snide, I also say this with grudging respect: I think there is a point at which implementing n% of a feature X and calling it X (rather than MaybeX or MostlyX) does give you some momentum and practical compatibility that you wouldn't have otherwise. Is it dishonest to hide the limitations regarding the edge cases in some documentation no one will read? Maybe. But will providing the feature solve more problems than it causes? Quite possibly.
I don't agree with MySQL's decision with respect to UTF-8, but I do understand it.
That's an important piece of context, thank you for pointing it out. Engineering decisions occur in a cultural context of mere humans making decisions, and we do well to remember that.
While I don’t know the history of MySQL, it seems to me that when they implemented it, their implementation was indeed in compliance with the standard (Unicode 3).
The standard has since grown from 16 to 32 bit code points.
Why MySQL had to introduce a new name for the UTF-8 encoded tables that can contain 32 bit code points is strange, but I assume there is a technical explanation (probably having to do with binary compatibility with existing tables / MySQL drivers or similar).
A data-loss bug is far more serious than anything about sorting. MySQL has an encoding called "utf8" -- if you can't count on it to round-trip UTF-8 data without data loss, that is a serious problem IMO, documented or not.
The real problem is that MySQL accepts and silently drops this data that it cannot handle. It has a nasty tendency to silently drop data rather than error out.
Really? I've done a lot of data importing into MySQL, and it always gives me a warning that tells me the invalid Unicode code point. It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.
It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.
That right there is the problem.
Other DBs, when you do something they can't handle, bail with a full-fledged error that stops what you're doing. MySQL doesn't do that for quite a few data-losing cases, and that's incredibly dangerous.
Incidentally, that page doesn't mention Unicode emoji (😊), which are probably more likely to get into your average database now that OS X supports them and such.
http://stackoverflow.com/questions/5567249/what-are-the-most...
Is it "excusable" that MySQLs implementation of UTF-8 isn't standard? That's a judgment call (they are up-front about it in the docs). But given that most unicode characters "in the wild" lie in the BMP, I can see how they'd make that trade-off. There might well be a technical limitation lurking somewhere in the database internals that made 4-byte characters a problem.