Hacker News new | past | comments | ask | show | jobs | submit login

It's true that glyph rendering won't matter to a database store, and that binary collation is "good enough" for most people (but then again, "good enough" is how you get to a UTF-8 implementation that doesn't support 4-byte characters). That said, it's also true that Unicode characters outside of the BMP are still pretty exotic/specialized:

http://stackoverflow.com/questions/5567249/what-are-the-most...

Is it "excusable" that MySQLs implementation of UTF-8 isn't standard? That's a judgment call (they are up-front about it in the docs). But given that most unicode characters "in the wild" lie in the BMP, I can see how they'd make that trade-off. There might well be a technical limitation lurking somewhere in the database internals that made 4-byte characters a problem.




No, it's not excusable - the MySQL project had a trivial alternative: don't call it UTF8!

The options are simple: implement UTF-8 correctly and call your implementation UTF-8, or implement just the BMP and name accordingly. They did neither: effectively, they lied to end-users. That's deeply, deeply problematic.


"they lied to end-users, effectively"

Yeah, that's not exaggeration at all. Because it's not as if they very clearly document exactly what they support:

http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html


Sure, so let's call this button that erases your data "list files", and just very clearly document it. It's not lying if we change the definition.


Bingo. That's why I used the "effectively" qualifier - if my toolchain says "oh this is UTF-8," then I should be able to trust that it's for-real, honest-to-goodness, spec-compliant UTF-8. If it's "oh this is the part of UTF-8 that was easy to implement " instead, then that tool has lied to me. I shouldn't have to read the documentation to find out that something is not what it claims to be.

Bonus points for the documentation brazenly ignoring that they're implementing something that's not spec-compliant and naming it like it is.


Well, to be fair, MySQL has a storied history of implementing 95% of a feature, calling it good enough, and shipping it.

And while, as a Postgres user, my tone here may be a little snide, I also say this with grudging respect: I think there is a point at which implementing n% of a feature X and calling it X (rather than MaybeX or MostlyX) does give you some momentum and practical compatibility that you wouldn't have otherwise. Is it dishonest to hide the limitations regarding the edge cases in some documentation no one will read? Maybe. But will providing the feature solve more problems than it causes? Quite possibly.

I don't agree with MySQL's decision with respect to UTF-8, but I do understand it.


That's an important piece of context, thank you for pointing it out. Engineering decisions occur in a cultural context of mere humans making decisions, and we do well to remember that.


don't call it UTF8!

While I don’t know the history of MySQL, it seems to me that when they implemented it, their implementation was indeed in compliance with the standard (Unicode 3).

The standard has since grown from 16 to 32 bit code points.

Why MySQL had to introduce a new name for the UTF-8 encoded tables that can contain 32 bit code points is strange, but I assume there is a technical explanation (probably having to do with binary compatibility with existing tables / MySQL drivers or similar).


A data-loss bug is far more serious than anything about sorting. MySQL has an encoding called "utf8" -- if you can't count on it to round-trip UTF-8 data without data loss, that is a serious problem IMO, documented or not.


The real problem is that MySQL accepts and silently drops this data that it cannot handle. It has a nasty tendency to silently drop data rather than error out.


Really? I've done a lot of data importing into MySQL, and it always gives me a warning that tells me the invalid Unicode code point. It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.


It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.

That right there is the problem.

Other DBs, when you do something they can't handle, bail with a full-fledged error that stops what you're doing. MySQL doesn't do that for quite a few data-losing cases, and that's incredibly dangerous.


Incidentally, that page doesn't mention Unicode emoji (😊), which are probably more likely to get into your average database now that OS X supports them and such.


Speaking of unicode emoji ... OpenFire the open source Jabber server will drop your connection if you send it a 💩 ...


Seems like either my browser or HN doesn't support such characters since I'm seeing a box with "01F 489" in it.


It's your browser; the characters are standard but not widely supported outside of Apple platforms yet.

Edit: you can try Symbola at http://users.teilar.gr/~g1951d/ (and yes, I keep editing this post :p)


Any idea if these are the characters GoSMS uses as "emoji"? i.e. is it a defacto standard?


The emoji unicode range was created to standardize the various characters used on Japanese mobiles. Hence why it contains e.g. the Love Hotel icon.


Dunno; I only know emoji encoding has historically been a crapfest.


after installing the "symbola" font i can see the 💩 , thanks


On debian install ttf-ancient-fonts version 2.56-1 or later (currently in testing - the one in stable does not include the symbol).

I installed it and now I see it as well.


I'm not quite sure why that symbol would be in an ancient font.


It's a repackaging of http://users.teilar.gr/~g1951d/ which is titled "Unicode Fonts for Ancient Scripts" so I guess they used that name.


Then call it utf8-bmp, not utf8!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: