Unicode is, in 2012, still “cutting edge.”

timr · on July 10, 2012

Yeah, well, Unicode is hard. If it looks simple, it's only because you haven't looked carefully enough:

http://www.unicode.org/versions/Unicode6.1.0/

The latest published version of the standard (5.0) runs to almost 1500 pages:

http://www.unicode.org/book/aboutbook.html

(Best quote from that page: “Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years.” -- Donald Knuth)

People think that Unicode support is just a matter of implementing multi-byte characters, but it's so much more: you've got collation rules, ligatures, rendering, line-breaking, punctuation, reading direction, and so on. Any technical standard that aims to cover all known human languages is going to be a little bit complex.

koenigdavidmj · on July 10, 2012

> People think that Unicode support is just a matter of implementing multi-byte characters, but it's so much more: you've got collation rules, ligatures, rendering, line-breaking, punctuation, reading direction, and so on. Any technical standard that aims to cover all known human languages is going to be a little bit complex.

Half of those are irrelevant to MySQL or any other database. Those are front end problems. Even reading punctuation, text direction, and the like will only be important in more advanced collation orders (as opposed to just binary ordering, which he was using).

timr · on July 10, 2012

It's true that glyph rendering won't matter to a database store, and that binary collation is "good enough" for most people (but then again, "good enough" is how you get to a UTF-8 implementation that doesn't support 4-byte characters). That said, it's also true that Unicode characters outside of the BMP are still pretty exotic/specialized:

http://stackoverflow.com/questions/5567249/what-are-the-most...

Is it "excusable" that MySQLs implementation of UTF-8 isn't standard? That's a judgment call (they are up-front about it in the docs). But given that most unicode characters "in the wild" lie in the BMP, I can see how they'd make that trade-off. There might well be a technical limitation lurking somewhere in the database internals that made 4-byte characters a problem.

sedev · on July 10, 2012

No, it's not excusable - the MySQL project had a trivial alternative: don't call it UTF8!

The options are simple: implement UTF-8 correctly and call your implementation UTF-8, or implement just the BMP and name accordingly. They did neither: effectively, they lied to end-users. That's deeply, deeply problematic.

timr · on July 10, 2012

"they lied to end-users, effectively"

Yeah, that's not exaggeration at all. Because it's not as if they very clearly document exactly what they support:

http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html

pepve · on July 10, 2012

Sure, so let's call this button that erases your data "list files", and just very clearly document it. It's not lying if we change the definition.

sedev · on July 10, 2012

Bingo. That's why I used the "effectively" qualifier - if my toolchain says "oh this is UTF-8," then I should be able to trust that it's for-real, honest-to-goodness, spec-compliant UTF-8. If it's "oh this is the part of UTF-8 that was easy to implement " instead, then that tool has lied to me. I shouldn't have to read the documentation to find out that something is not what it claims to be.

Bonus points for the documentation brazenly ignoring that they're implementing something that's not spec-compliant and naming it like it is.

deafbybeheading · on July 11, 2012

Well, to be fair, MySQL has a storied history of implementing 95% of a feature, calling it good enough, and shipping it.

And while, as a Postgres user, my tone here may be a little snide, I also say this with grudging respect: I think there is a point at which implementing n% of a feature X and calling it X (rather than MaybeX or MostlyX) does give you some momentum and practical compatibility that you wouldn't have otherwise. Is it dishonest to hide the limitations regarding the edge cases in some documentation no one will read? Maybe. But will providing the feature solve more problems than it causes? Quite possibly.

I don't agree with MySQL's decision with respect to UTF-8, but I do understand it.

sedev · on July 11, 2012

That's an important piece of context, thank you for pointing it out. Engineering decisions occur in a cultural context of mere humans making decisions, and we do well to remember that.

sorbits · on July 11, 2012

don't call it UTF8!

While I don’t know the history of MySQL, it seems to me that when they implemented it, their implementation was indeed in compliance with the standard (Unicode 3).

The standard has since grown from 16 to 32 bit code points.

Why MySQL had to introduce a new name for the UTF-8 encoded tables that can contain 32 bit code points is strange, but I assume there is a technical explanation (probably having to do with binary compatibility with existing tables / MySQL drivers or similar).

haberman · on July 10, 2012

A data-loss bug is far more serious than anything about sorting. MySQL has an encoding called "utf8" -- if you can't count on it to round-trip UTF-8 data without data loss, that is a serious problem IMO, documented or not.

koenigdavidmj · on July 10, 2012

The real problem is that MySQL accepts and silently drops this data that it cannot handle. It has a nasty tendency to silently drop data rather than error out.

crazygringo · on July 11, 2012

Really? I've done a lot of data importing into MySQL, and it always gives me a warning that tells me the invalid Unicode code point. It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.

ubernostrum · on July 11, 2012

It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.

That right there is the problem.

Other DBs, when you do something they can't handle, bail with a full-fledged error that stops what you're doing. MySQL doesn't do that for quite a few data-losing cases, and that's incredibly dangerous.

comex · on July 10, 2012

Incidentally, that page doesn't mention Unicode emoji (😊), which are probably more likely to get into your average database now that OS X supports them and such.

X-Istence · on July 10, 2012

Speaking of unicode emoji ... OpenFire the open source Jabber server will drop your connection if you send it a 💩 ...

ars · on July 10, 2012

Seems like either my browser or HN doesn't support such characters since I'm seeing a box with "01F 489" in it.

comex · on July 10, 2012

It's your browser; the characters are standard but not widely supported outside of Apple platforms yet.

Edit: you can try Symbola at http://users.teilar.gr/~g1951d/ (and yes, I keep editing this post :p)

voltagex_ · on July 10, 2012

Any idea if these are the characters GoSMS uses as "emoji"? i.e. is it a defacto standard?

unconed · on July 11, 2012

The emoji unicode range was created to standardize the various characters used on Japanese mobiles. Hence why it contains e.g. the Love Hotel icon.

comex · on July 10, 2012

Dunno; I only know emoji encoding has historically been a crapfest.

darkstalker · on July 11, 2012

after installing the "symbola" font i can see the 💩 , thanks

ars · on July 11, 2012

On debian install ttf-ancient-fonts version 2.56-1 or later (currently in testing - the one in stable does not include the symbol).

I installed it and now I see it as well.

vacri · on July 11, 2012

I'm not quite sure why that symbol would be in an ancient font.

ars · on July 12, 2012

It's a repackaging of http://users.teilar.gr/~g1951d/ which is titled "Unicode Fonts for Ancient Scripts" so I guess they used that name.

aidenn0 · on July 10, 2012

Then call it utf8-bmp, not utf8!

est · on July 11, 2012

The one thing boggles my mind is that Unicode is, as a matter of fact, not a universal coding. It's a cluster fuck of multiple encodings (from a same code point). We really should call it multicode, it only solves problems by introducing more. You got problems with Unicode? You are not using the right unicode. Try another unicode.

gioele · on July 11, 2012

> The latest published version of the standard (5.0) runs to almost 1500 pages:

Most of those pages are the just repertoire: a printout of a long uneventful table that lists all the available code-points. For each code-point is gives you a reference pre-rendered glyph, the languages where you can find it, the kind of character it is (numeric, alphabetic, symbol) and so on.

From an implementer point of view there are about 200 pages of interesting and extremely detailed stuff. The rest of the pages can be downloaded as text tables from the Unicode site and its companion sites.

lambda · on July 11, 2012

The thing is, the vast majority of this isn't needed by most applications. For instance, in a database, all they need is to encode and decode strings properly when going to or coming from external sources. That's all that is being asked for here; that when MySQL says "UTF-8", they really mean "UTF-8", not some broken subset of it.

VMG · on July 10, 2012

> MySQL’s utf8 encoding only covers the BMP. It can’t handle 4-byte characters at all.

Wow. That is pathetic.

excuse-me · on July 10, 2012

No it's engineering. Here are two versions of the software.

1, is faster, better tested and string handling (it's a database!) is much faster but it only handles the 65000 most common characters

2, this one can handle upside down characters from a 1930s paper on formal logic in Turkish. But is slower for all other cases and we haven't really tested it as much,.

Do you have a redundant,self powered , asteroid impact proof internet connection? No? Pathetic !

pindi · on July 10, 2012

Sure, in some cases it makes sense to make the tradeoff of not handling more obscure characters. But if the tradeoff is made, the encoding should not be called UTF-8.

"UTF-8 (UCS Transformation Format—8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set," says Wikipedia. The UTF-8 implementation in MySQL does not meet this definition because it cannot represent every character in the Unicode character set.

wmf · on July 10, 2012

When MySQL first implemented UTF-8 they probably did support every Unicode character... because there were less than 64K Unicode characters. Then Unicode/UTF-8 was redefined out from under them.

quink · on July 10, 2012

> there were less than 64K Unicode characters. Then Unicode/UTF-8 was redefined out from under them.

Unicode 2.0 introduced multiple planes, i.e. more than 65536 characters. That was in 1996. If that was the case, then MySQL has had more than one-and-a-half decades to introduce multiple planes and seems to have done so less than a year ago. I disagree with being 'redefined out from under them', when it was defined a year after MySQL started, at a time when it probably didn't even have Unicode support yet anyway.

jeltz · on July 10, 2012

Interesting to note is that MySQL was first released in 1995. Which means that only for one year of its existence were there less than 65536 characters.

klodolph · on July 11, 2012

Yes, but at the time, UTF-8 could encode up to 31 bits per character using six-byte sequences. It has since been restricted to four-byte sequences at the longest.

ars · on July 10, 2012

> seems to have done so less than a year ago

More than 2 years ago. March 2010.

excuse-me · on July 10, 2012

But when did people really start using more than the 16bit unicode chars?

quink · on July 11, 2012

> But when did people really start using more than the 16bit unicode chars?

1996.

China even made it a legal requirement for computer systems in 2000, through mandating GB 18030.

There's the Private Use Area if nothing else. There is NO excuse to not support anything other than the BMP. Adding support is trivial unless you have been using UTF-16 in the erroneous belief that it's two bytes long always (in which case you've really been using UCS-2).

pyre · on July 10, 2012

This is just the original argument restated. It's not a rebuttal.

simonster · on July 10, 2012

According to Wikipedia, the original version of UTF-8 supported >4 byte characters, and was later restricted to 4 bytes by RFC 3629 in November 2003, seven months before MySQL 4.1 was released with Unicode support. (There were 96,447 Unicode characters at that time.)

ori_b · on July 10, 2012

There's no good reason for one to be faster than the other, though! They're both utf8 encoded sequences of bytes, and there's no good reason to not stream through them as utf8.

ars · on July 10, 2012

> There's no good reason for one to be faster than the other, though!

Not exactly. A varchar will store it as is, but a char column will allocate a fixed 3 (or 4) bytes for each character.

All data stored in memory (for sorting and such) is always as char, even if it started as varchar.

So by allowing 4 bytes per character they use more memory.

excuse-me · on July 10, 2012

Internally it can store them as 16bit so strings have a fixed length

pjscott · on July 10, 2012

Does anybody really think that UCS-2 is a good idea anymore? Or that random indexability by code point is all that valuable, in a world with combining glyphs and bidirectional characters and whatever other crazy stuff Unicode has? If you just want an upper bound on the number of bytes needed to store n code points, then (a) that's probably not a particularly useful question to ask, and (b) if you assume that 32 bits is enough for any code point, then the space taken by properly-formed UTF-8 is bounded.

So, why would they want to store things internally as UCS-2? Or rather, why should they?

jeltz · on July 10, 2012

Which is in many cases not faster than UTF-8 since UTF-16 is often more bytes than UTF-8. This matters especially since we are talking about a database which means IO and RAM usage probably are more important than the CPU.

Dylan16807 · on July 11, 2012

Except it didn't. It used up to three bytes per character. It had the drawbacks of variable width but not the easy-to-add benefits.

Or do you mean in-memory being different from the file format AND different from the I/O format? That doesn't sound terribly efficient.

desas · on July 10, 2012

I think I'll skip both and install postgres

makmanalp · on July 10, 2012

I think one of the main points that the article touches upon is: "It's 2012, unicode was invented 20 years ago, why the hell has no one tested this yet, let alone gotten it to work?"

EdiX · on July 11, 2012

Because trans-bmp characters are very rare (think less than 1/10000000) and unless you work with specific corpora you may never come across them.

wmf · on July 10, 2012

It looks like we're still seeing fallout from the "16 bits are all you need" thing. Maybe telling people that they'd never need to worry about this stuff after adopting (BMP subset of) UTF-8 wasn't a great idea.

derleth · on July 10, 2012

> (BMP subset of) UTF-8

The hell of it is, UTF-8 expands gracefully to the astral planes; it's UTF-16 that you need to worry about, either because the people designing the software never heard of surrogate pairs, in which case they didn't give you UTF-16 but UCS-2, or implemented surrogate pairs incorrectly.

peteretep · on July 10, 2012

This particular limitation of MySQL hit me HARD HARD HARD circa 2008 when I tried to use some of that upside down text you find online as test data, and just couldn't work out why I was getting data corruption.

Luckily Perl's Unicode support is fantastic, and saved my ass

postfuturist · on July 10, 2012

How did you solve MySQL's unicode limitations with Perl?

lmm · on July 10, 2012

At my previous^2 company we worked around MySQL's limitations by just storing our data in VARBINARY columns, encoded as utf8 on the client-side. Worked like a charm. (I hate MySQL)

joelhaasnoot · on July 10, 2012

Unicode is just plain hard to do right...

I worked on a product for a couple of months geared at minority languages in developing countries - doing linguistics work etc. It was a pain to support Unicode, because there's lots of code points, lots of weird cases (what's capital?), ICU is a good lib but it's not up to date to the latest Unicode version and it's C/C++ (and thus a pain in C#). Oh, and there's the Private Use Area where characters go while Unicode decides to include them or not...

antihero · on July 10, 2012

If we're using PostgreSQL is everything going to be pretty much dandy?

jeltz · on July 10, 2012

Yes, the PostgreSQL developers would never allow commit something which silently truncates perfectly valid text.

ocharles · on July 10, 2012

As long as your clusters use the right locale/encoding settings, yes.

justincormack · on July 11, 2012

Only use a locale that does unicode collation if you need to, as it is a big performance hit.

PaulHoule · on July 11, 2012

it's almost scandalous how poorly Unicode has been implemented in the most popular development platforms.

Java, for instance, implemented 2-byte encoding and uses surrogates for the higher planes, which means you get the worst of both worlds... You double the size of ASCII text (that is, half the speed of bandwidth-limited operations on text) and you've still got a variable length encoding... but you've got lots of methods and user-written code that assume that the text is fixed length encoded. what a mess

a2tech · on July 10, 2012

Thats because Unicode is STILL a pain to use. I read the articles that come along about Unicode and still don't understand why handling it is so impossible. Until its transparent for a programmer to use, it won't be as widely used as it should. My apps (and I'm ashamed to admit it) aren't Unicode friendly. But its too much work currently for too little reward to go through and make all that code Unicode friendly.

aidenn0 · on July 10, 2012

I've had very few issues making programs I've written unicode friendly, but then I gave up on MySQL about a decade ago.

The only real issue is handling bad input, as you never get an error with decoding e.g. ISO-8859-1; for a lot of applications you need to handle potentially malicious input, so you can do it there, but even for trusted input, there is a lot of really-broken external programs that output "UTF-8" or "UTF-16" (scare quotes intentional).

I really think a lot of the problems with unicode is that a lot of languages/libraries try to handle it transparently, and that just doesn't work; encoding/decoding is part of dealing with external formats, and trying to do it transparently means that it will fail unexpectedly.

sp332 · on July 10, 2012

Who even makes a UTF-8 implementation that can't handle UTF-8? Unfortunately, having dealt with mysql libs lately, I don't doubt that at all.

That said, if you think Unicode is a pain, try storing and retrieving "𝒜" in any other encoding. I'll stick to Unicode, thanks. :)

rwallace · on July 11, 2012

Implementing the display, collation or suchlike manipulation of Unicode text isn't feasible for the nonspecialist, to be sure (in practice that means you call an appropriate library/framework/API and hope they got it right).

But storing a stream of UTF-8 and retrieving it on command is not remotely difficult. You almost have to go out of your way to screw that up.

henrikschroder · on July 10, 2012

> Thats because Unicode is STILL a pain to use.

I think that depends heavily on what programming language and frameworks you are using.

jpdoctor · on July 10, 2012

Pointers to a comparison across languages would be welcomed. TIA.

LoonyPandora · on July 10, 2012

http://training.perl.com/OSCON2011/index.html

Specifically the talk titled "Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly"

It's a year old now, but it's still relevant. It gives a very detailed look at unicode support across JavaScript, PHP, Go, Ruby, Python, Java, and Perl.

arnsholt · on July 10, 2012

Tom Christiansen's talk Unicode: The good, the bad, & the (mostly) ugly covers a lot of Unicode wrinkles and how Javascript, PHP, Go, Python, Ruby, Java, and Perl handles them.

Slides for that talk and two other Perl-related Unicode talks are at http://training.perl.com/OSCON2011/index.html

BenSS · on July 11, 2012

The root of it is that most things still assume ASCII, etc by default. I griped about this a while ago, as a whole, we should be moving all our tools to being UTF-8 by default, and ASCII being the 'odd' case instead of the other way around!

dunham · on July 10, 2012

Apple's "textedit" can't open files with these characters in them. It reports "The document “test.txt” could not be opened. Text encoding Unicode (UTF-8) isn’t applicable."

lambda · on July 10, 2012

Works fine for me. What version of Mac OS X/TextEdit are you using? Are you sure you are saving it (and opening it) as UTF-8?

dunham · on July 11, 2012

Sorry, my mistake. I was saving a file from textmate with the string "𝖙𝖊𝖘𝖙" in it and then opening with "open -a textedit eg.txt".

The same experiment with cat in place of textmate works fine, so it's textmate that is buggy.

According to "od -x1", textmate is writing:

  0000000    ed  a0  b5  ed  b6  99  ed  a0  b5  ed  b6  8a  ed  a0  b5  ed
  0000020    b6  98  ed  a0  b5  ed  b6  99

So textedit is right to complain.

lambda · on July 11, 2012

Yeah, looks like Textmate is simply running the UTF-8 algorithm over UTF-16 code units, so each surrogate is being turned into a single UTF-8 code unit (which decodes to an invalid character).

It turns out that this is such a common mistake that there's even a name for this encoding, CESU-8: http://en.wikipedia.org/wiki/CESU-8

Ralith · on July 10, 2012

"Isn't applicable?" What kind of an error message is that?

chadrs · on July 10, 2012

You don't even want to know the hack I had to implement to get non-BMP characters stored in MySQL 5.1 'utf8' column.

pjscott · on July 10, 2012

That's exactly why I use the default ISO-8859-1 encoding for all my MySQL tables, try to only stick ASCII into it, and store any Unicode text as binary UTF-8, encoded client-side. It's stupid and shouldn't be necessary, but at least MySQL can't screw it up.

tezza · on July 11, 2012

Man, I still find myself reaching for

  dos2unix

all the time. How sad.

webreac · on July 11, 2012

These 16bits, UCS, wide char ideas are just plain wrong ! Just use UTF-8 for the files and for communication and 32bits code points internally when needed.

krollew · on July 11, 2012

Yeah, so what? UTF-8 is Linux standard. I use it, probably most of web use it(where needed). Why I don't see problems with it in software I use and on the web? I can't remember when I had any problem with UTF-8 last time.

mrj · on July 10, 2012

This is a MySQL limitation that they have fixed in recent releases (as the OP notes). It's not fair to blame Unicode at large for MySQL problems.

However, it's true that Unicode is (relatively speaking) very new for such a fundamental technology. Support in applications still varies widely. I wouldn't characterize it as cutting edge though, since we have many mainstream programming languages built using Unicode internally.

masklinn · on July 10, 2012

> This is a MySQL limitation that they have fixed in recent releases (as the OP notes)

TFA notes that this is not fixed, the `utf-8` mysql encoding still isn't utf-8. And as TFA also notes related technologies (aka drivers) may not be compatible with it (the example he uses, mysql2 for Ruby, still hasn't had an official release supporting utf8mb4[0])

> it's true that Unicode is (relatively speaking) very new for such a fundamental technology

That's becoming quite hard an argument to swallow when encountering astral planes issues in 2012 when Unicode 2.0 was introduced in 1996.

[0] https://github.com/brianmario/mysql2/issues/249

mrj · on July 10, 2012

> > it's true that Unicode is (relatively speaking) very new for such a fundamental technology

> That's becoming quite hard an argument to swallow when encountering astral planes issues in 2012 when Unicode 2.0 was introduced in 1996.

I don't get your argument. MySQL was also released around that time and we don't call it "cutting edge" because we found a bug. There are bugs in old stuff all the time but (most) people don't throw a fit.

pyre · on July 10, 2012

Unicode is being called 'cutting edge' because it's no longer 'old hat.' Lots of things claim support for Unicode, but few (or none) support it well. Unicode isn't a software project, it's a spec/idea. It's like calling a Star Trek tricorder "cutting edge" because no one has implemented a fully-functional version. Sure the idea has been around for a while, but at this point there's no acceptable manifestations of that idea.

masklinn · on July 11, 2012

> I don't get your argument. MySQL was also released around that time and we don't call it "cutting edge" because we found a bug.

Not supporting astral planes and saying you're supporting utf-8 is not a bug, it's a lie.

crazygringo · on July 11, 2012

Honestly it's probably better they don't change the behavior of an existing MySQL character set. Who knows what software out there depends on it breaking on 4-byte characters, or whatnot.

Creating a new character set `utf8mb4` was the right thing to do, as annoying as it is. Just clearly label the `utf8` collation as 'deprecated' in the docs or something.

masklinn · on July 11, 2012

> Honestly it's probably better they don't change the behavior of an existing MySQL character set.

Or they could just have implemented it correctly to start with, considering unicode "support" was introduced in mysql 4.1.

In 2005.

> Who knows what software out there depends on it breaking on 4-byte characters, or whatnot.

Then again, mysql routinely drops and corrupts data anyway, I'm sure its "users" could have dealt with it corrupting data slightly less than before.