It's true that glyph rendering won't matter to a database store, and that binary...

sedev · on July 10, 2012

No, it's not excusable - the MySQL project had a trivial alternative: don't call it UTF8!

The options are simple: implement UTF-8 correctly and call your implementation UTF-8, or implement just the BMP and name accordingly. They did neither: effectively, they lied to end-users. That's deeply, deeply problematic.

timr · on July 10, 2012

"they lied to end-users, effectively"

Yeah, that's not exaggeration at all. Because it's not as if they very clearly document exactly what they support:

http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html

pepve · on July 10, 2012

Sure, so let's call this button that erases your data "list files", and just very clearly document it. It's not lying if we change the definition.

sedev · on July 10, 2012

Bingo. That's why I used the "effectively" qualifier - if my toolchain says "oh this is UTF-8," then I should be able to trust that it's for-real, honest-to-goodness, spec-compliant UTF-8. If it's "oh this is the part of UTF-8 that was easy to implement " instead, then that tool has lied to me. I shouldn't have to read the documentation to find out that something is not what it claims to be.

Bonus points for the documentation brazenly ignoring that they're implementing something that's not spec-compliant and naming it like it is.

deafbybeheading · on July 11, 2012

Well, to be fair, MySQL has a storied history of implementing 95% of a feature, calling it good enough, and shipping it.

And while, as a Postgres user, my tone here may be a little snide, I also say this with grudging respect: I think there is a point at which implementing n% of a feature X and calling it X (rather than MaybeX or MostlyX) does give you some momentum and practical compatibility that you wouldn't have otherwise. Is it dishonest to hide the limitations regarding the edge cases in some documentation no one will read? Maybe. But will providing the feature solve more problems than it causes? Quite possibly.

I don't agree with MySQL's decision with respect to UTF-8, but I do understand it.

sedev · on July 11, 2012

That's an important piece of context, thank you for pointing it out. Engineering decisions occur in a cultural context of mere humans making decisions, and we do well to remember that.

sorbits · on July 11, 2012

don't call it UTF8!

While I don’t know the history of MySQL, it seems to me that when they implemented it, their implementation was indeed in compliance with the standard (Unicode 3).

The standard has since grown from 16 to 32 bit code points.

Why MySQL had to introduce a new name for the UTF-8 encoded tables that can contain 32 bit code points is strange, but I assume there is a technical explanation (probably having to do with binary compatibility with existing tables / MySQL drivers or similar).

haberman · on July 10, 2012

A data-loss bug is far more serious than anything about sorting. MySQL has an encoding called "utf8" -- if you can't count on it to round-trip UTF-8 data without data loss, that is a serious problem IMO, documented or not.

koenigdavidmj · on July 10, 2012

The real problem is that MySQL accepts and silently drops this data that it cannot handle. It has a nasty tendency to silently drop data rather than error out.

crazygringo · on July 11, 2012

Really? I've done a lot of data importing into MySQL, and it always gives me a warning that tells me the invalid Unicode code point. It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.

ubernostrum · on July 11, 2012

It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.

That right there is the problem.

Other DBs, when you do something they can't handle, bail with a full-fledged error that stops what you're doing. MySQL doesn't do that for quite a few data-losing cases, and that's incredibly dangerous.

comex · on July 10, 2012

Incidentally, that page doesn't mention Unicode emoji (😊), which are probably more likely to get into your average database now that OS X supports them and such.

X-Istence · on July 10, 2012

Speaking of unicode emoji ... OpenFire the open source Jabber server will drop your connection if you send it a 💩 ...

ars · on July 10, 2012

Seems like either my browser or HN doesn't support such characters since I'm seeing a box with "01F 489" in it.

comex · on July 10, 2012

It's your browser; the characters are standard but not widely supported outside of Apple platforms yet.

Edit: you can try Symbola at http://users.teilar.gr/~g1951d/ (and yes, I keep editing this post :p)

voltagex_ · on July 10, 2012

Any idea if these are the characters GoSMS uses as "emoji"? i.e. is it a defacto standard?

unconed · on July 11, 2012

The emoji unicode range was created to standardize the various characters used on Japanese mobiles. Hence why it contains e.g. the Love Hotel icon.

comex · on July 10, 2012

Dunno; I only know emoji encoding has historically been a crapfest.

darkstalker · on July 11, 2012

after installing the "symbola" font i can see the 💩 , thanks

ars · on July 11, 2012

On debian install ttf-ancient-fonts version 2.56-1 or later (currently in testing - the one in stable does not include the symbol).

I installed it and now I see it as well.

vacri · on July 11, 2012

I'm not quite sure why that symbol would be in an ancient font.

ars · on July 12, 2012

It's a repackaging of http://users.teilar.gr/~g1951d/ which is titled "Unicode Fonts for Ancient Scripts" so I guess they used that name.

aidenn0 · on July 10, 2012

Then call it utf8-bmp, not utf8!