Hacker News new | past | comments | ask | show | jobs | submit login
Unicode is, in 2012, still “cutting edge.” (utexas.edu)
115 points by masklinn on July 10, 2012 | hide | past | favorite | 87 comments



Yeah, well, Unicode is hard. If it looks simple, it's only because you haven't looked carefully enough:

http://www.unicode.org/versions/Unicode6.1.0/

The latest published version of the standard (5.0) runs to almost 1500 pages:

http://www.unicode.org/book/aboutbook.html

(Best quote from that page: “Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years.” -- Donald Knuth)

People think that Unicode support is just a matter of implementing multi-byte characters, but it's so much more: you've got collation rules, ligatures, rendering, line-breaking, punctuation, reading direction, and so on. Any technical standard that aims to cover all known human languages is going to be a little bit complex.


> People think that Unicode support is just a matter of implementing multi-byte characters, but it's so much more: you've got collation rules, ligatures, rendering, line-breaking, punctuation, reading direction, and so on. Any technical standard that aims to cover all known human languages is going to be a little bit complex.

Half of those are irrelevant to MySQL or any other database. Those are front end problems. Even reading punctuation, text direction, and the like will only be important in more advanced collation orders (as opposed to just binary ordering, which he was using).


It's true that glyph rendering won't matter to a database store, and that binary collation is "good enough" for most people (but then again, "good enough" is how you get to a UTF-8 implementation that doesn't support 4-byte characters). That said, it's also true that Unicode characters outside of the BMP are still pretty exotic/specialized:

http://stackoverflow.com/questions/5567249/what-are-the-most...

Is it "excusable" that MySQLs implementation of UTF-8 isn't standard? That's a judgment call (they are up-front about it in the docs). But given that most unicode characters "in the wild" lie in the BMP, I can see how they'd make that trade-off. There might well be a technical limitation lurking somewhere in the database internals that made 4-byte characters a problem.


No, it's not excusable - the MySQL project had a trivial alternative: don't call it UTF8!

The options are simple: implement UTF-8 correctly and call your implementation UTF-8, or implement just the BMP and name accordingly. They did neither: effectively, they lied to end-users. That's deeply, deeply problematic.


"they lied to end-users, effectively"

Yeah, that's not exaggeration at all. Because it's not as if they very clearly document exactly what they support:

http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html


Sure, so let's call this button that erases your data "list files", and just very clearly document it. It's not lying if we change the definition.


Bingo. That's why I used the "effectively" qualifier - if my toolchain says "oh this is UTF-8," then I should be able to trust that it's for-real, honest-to-goodness, spec-compliant UTF-8. If it's "oh this is the part of UTF-8 that was easy to implement " instead, then that tool has lied to me. I shouldn't have to read the documentation to find out that something is not what it claims to be.

Bonus points for the documentation brazenly ignoring that they're implementing something that's not spec-compliant and naming it like it is.


Well, to be fair, MySQL has a storied history of implementing 95% of a feature, calling it good enough, and shipping it.

And while, as a Postgres user, my tone here may be a little snide, I also say this with grudging respect: I think there is a point at which implementing n% of a feature X and calling it X (rather than MaybeX or MostlyX) does give you some momentum and practical compatibility that you wouldn't have otherwise. Is it dishonest to hide the limitations regarding the edge cases in some documentation no one will read? Maybe. But will providing the feature solve more problems than it causes? Quite possibly.

I don't agree with MySQL's decision with respect to UTF-8, but I do understand it.


That's an important piece of context, thank you for pointing it out. Engineering decisions occur in a cultural context of mere humans making decisions, and we do well to remember that.


don't call it UTF8!

While I don’t know the history of MySQL, it seems to me that when they implemented it, their implementation was indeed in compliance with the standard (Unicode 3).

The standard has since grown from 16 to 32 bit code points.

Why MySQL had to introduce a new name for the UTF-8 encoded tables that can contain 32 bit code points is strange, but I assume there is a technical explanation (probably having to do with binary compatibility with existing tables / MySQL drivers or similar).


A data-loss bug is far more serious than anything about sorting. MySQL has an encoding called "utf8" -- if you can't count on it to round-trip UTF-8 data without data loss, that is a serious problem IMO, documented or not.


The real problem is that MySQL accepts and silently drops this data that it cannot handle. It has a nasty tendency to silently drop data rather than error out.


Really? I've done a lot of data importing into MySQL, and it always gives me a warning that tells me the invalid Unicode code point. It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.


It doesn't generate an error, but it definitely generates a warning. You need to make sure whatever client/library you're using is passing that on to you, though.

That right there is the problem.

Other DBs, when you do something they can't handle, bail with a full-fledged error that stops what you're doing. MySQL doesn't do that for quite a few data-losing cases, and that's incredibly dangerous.


Incidentally, that page doesn't mention Unicode emoji (😊), which are probably more likely to get into your average database now that OS X supports them and such.


Speaking of unicode emoji ... OpenFire the open source Jabber server will drop your connection if you send it a 💩 ...


Seems like either my browser or HN doesn't support such characters since I'm seeing a box with "01F 489" in it.


It's your browser; the characters are standard but not widely supported outside of Apple platforms yet.

Edit: you can try Symbola at http://users.teilar.gr/~g1951d/ (and yes, I keep editing this post :p)


Any idea if these are the characters GoSMS uses as "emoji"? i.e. is it a defacto standard?


The emoji unicode range was created to standardize the various characters used on Japanese mobiles. Hence why it contains e.g. the Love Hotel icon.


Dunno; I only know emoji encoding has historically been a crapfest.


after installing the "symbola" font i can see the 💩 , thanks


On debian install ttf-ancient-fonts version 2.56-1 or later (currently in testing - the one in stable does not include the symbol).

I installed it and now I see it as well.


I'm not quite sure why that symbol would be in an ancient font.


It's a repackaging of http://users.teilar.gr/~g1951d/ which is titled "Unicode Fonts for Ancient Scripts" so I guess they used that name.


Then call it utf8-bmp, not utf8!


The one thing boggles my mind is that Unicode is, as a matter of fact, not a universal coding. It's a cluster fuck of multiple encodings (from a same code point). We really should call it multicode, it only solves problems by introducing more. You got problems with Unicode? You are not using the right unicode. Try another unicode.


> The latest published version of the standard (5.0) runs to almost 1500 pages:

Most of those pages are the just repertoire: a printout of a long uneventful table that lists all the available code-points. For each code-point is gives you a reference pre-rendered glyph, the languages where you can find it, the kind of character it is (numeric, alphabetic, symbol) and so on.

From an implementer point of view there are about 200 pages of interesting and extremely detailed stuff. The rest of the pages can be downloaded as text tables from the Unicode site and its companion sites.


The thing is, the vast majority of this isn't needed by most applications. For instance, in a database, all they need is to encode and decode strings properly when going to or coming from external sources. That's all that is being asked for here; that when MySQL says "UTF-8", they really mean "UTF-8", not some broken subset of it.


> MySQL’s utf8 encoding only covers the BMP. It can’t handle 4-byte characters at all.

Wow. That is pathetic.


No it's engineering. Here are two versions of the software.

1, is faster, better tested and string handling (it's a database!) is much faster but it only handles the 65000 most common characters

2, this one can handle upside down characters from a 1930s paper on formal logic in Turkish. But is slower for all other cases and we haven't really tested it as much,.

Do you have a redundant,self powered , asteroid impact proof internet connection? No? Pathetic !


Sure, in some cases it makes sense to make the tradeoff of not handling more obscure characters. But if the tradeoff is made, the encoding should not be called UTF-8.

"UTF-8 (UCS Transformation Format—8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set," says Wikipedia. The UTF-8 implementation in MySQL does not meet this definition because it cannot represent every character in the Unicode character set.


When MySQL first implemented UTF-8 they probably did support every Unicode character... because there were less than 64K Unicode characters. Then Unicode/UTF-8 was redefined out from under them.


> there were less than 64K Unicode characters. Then Unicode/UTF-8 was redefined out from under them.

Unicode 2.0 introduced multiple planes, i.e. more than 65536 characters. That was in 1996. If that was the case, then MySQL has had more than one-and-a-half decades to introduce multiple planes and seems to have done so less than a year ago. I disagree with being 'redefined out from under them', when it was defined a year after MySQL started, at a time when it probably didn't even have Unicode support yet anyway.


Interesting to note is that MySQL was first released in 1995. Which means that only for one year of its existence were there less than 65536 characters.


Yes, but at the time, UTF-8 could encode up to 31 bits per character using six-byte sequences. It has since been restricted to four-byte sequences at the longest.


> seems to have done so less than a year ago

More than 2 years ago. March 2010.


But when did people really start using more than the 16bit unicode chars?


> But when did people really start using more than the 16bit unicode chars?

1996.

China even made it a legal requirement for computer systems in 2000, through mandating GB 18030.

There's the Private Use Area if nothing else. There is NO excuse to not support anything other than the BMP. Adding support is trivial unless you have been using UTF-16 in the erroneous belief that it's two bytes long always (in which case you've really been using UCS-2).


This is just the original argument restated. It's not a rebuttal.


According to Wikipedia, the original version of UTF-8 supported >4 byte characters, and was later restricted to 4 bytes by RFC 3629 in November 2003, seven months before MySQL 4.1 was released with Unicode support. (There were 96,447 Unicode characters at that time.)


There's no good reason for one to be faster than the other, though! They're both utf8 encoded sequences of bytes, and there's no good reason to not stream through them as utf8.


> There's no good reason for one to be faster than the other, though!

Not exactly. A varchar will store it as is, but a char column will allocate a fixed 3 (or 4) bytes for each character.

All data stored in memory (for sorting and such) is always as char, even if it started as varchar.

So by allowing 4 bytes per character they use more memory.


Internally it can store them as 16bit so strings have a fixed length


Does anybody really think that UCS-2 is a good idea anymore? Or that random indexability by code point is all that valuable, in a world with combining glyphs and bidirectional characters and whatever other crazy stuff Unicode has? If you just want an upper bound on the number of bytes needed to store n code points, then (a) that's probably not a particularly useful question to ask, and (b) if you assume that 32 bits is enough for any code point, then the space taken by properly-formed UTF-8 is bounded.

So, why would they want to store things internally as UCS-2? Or rather, why should they?


Which is in many cases not faster than UTF-8 since UTF-16 is often more bytes than UTF-8. This matters especially since we are talking about a database which means IO and RAM usage probably are more important than the CPU.


Except it didn't. It used up to three bytes per character. It had the drawbacks of variable width but not the easy-to-add benefits.

Or do you mean in-memory being different from the file format AND different from the I/O format? That doesn't sound terribly efficient.


I think I'll skip both and install postgres


I think one of the main points that the article touches upon is: "It's 2012, unicode was invented 20 years ago, why the hell has no one tested this yet, let alone gotten it to work?"


Because trans-bmp characters are very rare (think less than 1/10000000) and unless you work with specific corpora you may never come across them.


It looks like we're still seeing fallout from the "16 bits are all you need" thing. Maybe telling people that they'd never need to worry about this stuff after adopting (BMP subset of) UTF-8 wasn't a great idea.


> (BMP subset of) UTF-8

The hell of it is, UTF-8 expands gracefully to the astral planes; it's UTF-16 that you need to worry about, either because the people designing the software never heard of surrogate pairs, in which case they didn't give you UTF-16 but UCS-2, or implemented surrogate pairs incorrectly.


This particular limitation of MySQL hit me HARD HARD HARD circa 2008 when I tried to use some of that upside down text you find online as test data, and just couldn't work out why I was getting data corruption.

Luckily Perl's Unicode support is fantastic, and saved my ass


How did you solve MySQL's unicode limitations with Perl?


At my previous^2 company we worked around MySQL's limitations by just storing our data in VARBINARY columns, encoded as utf8 on the client-side. Worked like a charm. (I hate MySQL)


Unicode is just plain hard to do right...

I worked on a product for a couple of months geared at minority languages in developing countries - doing linguistics work etc. It was a pain to support Unicode, because there's lots of code points, lots of weird cases (what's capital?), ICU is a good lib but it's not up to date to the latest Unicode version and it's C/C++ (and thus a pain in C#). Oh, and there's the Private Use Area where characters go while Unicode decides to include them or not...


If we're using PostgreSQL is everything going to be pretty much dandy?


Yes, the PostgreSQL developers would never allow commit something which silently truncates perfectly valid text.


As long as your clusters use the right locale/encoding settings, yes.


Only use a locale that does unicode collation if you need to, as it is a big performance hit.


it's almost scandalous how poorly Unicode has been implemented in the most popular development platforms.

Java, for instance, implemented 2-byte encoding and uses surrogates for the higher planes, which means you get the worst of both worlds... You double the size of ASCII text (that is, half the speed of bandwidth-limited operations on text) and you've still got a variable length encoding... but you've got lots of methods and user-written code that assume that the text is fixed length encoded. what a mess


Thats because Unicode is STILL a pain to use. I read the articles that come along about Unicode and still don't understand why handling it is so impossible. Until its transparent for a programmer to use, it won't be as widely used as it should. My apps (and I'm ashamed to admit it) aren't Unicode friendly. But its too much work currently for too little reward to go through and make all that code Unicode friendly.


I've had very few issues making programs I've written unicode friendly, but then I gave up on MySQL about a decade ago.

The only real issue is handling bad input, as you never get an error with decoding e.g. ISO-8859-1; for a lot of applications you need to handle potentially malicious input, so you can do it there, but even for trusted input, there is a lot of really-broken external programs that output "UTF-8" or "UTF-16" (scare quotes intentional).

I really think a lot of the problems with unicode is that a lot of languages/libraries try to handle it transparently, and that just doesn't work; encoding/decoding is part of dealing with external formats, and trying to do it transparently means that it will fail unexpectedly.


Who even makes a UTF-8 implementation that can't handle UTF-8? Unfortunately, having dealt with mysql libs lately, I don't doubt that at all.

That said, if you think Unicode is a pain, try storing and retrieving "𝒜" in any other encoding. I'll stick to Unicode, thanks. :)


Implementing the display, collation or suchlike manipulation of Unicode text isn't feasible for the nonspecialist, to be sure (in practice that means you call an appropriate library/framework/API and hope they got it right).

But storing a stream of UTF-8 and retrieving it on command is not remotely difficult. You almost have to go out of your way to screw that up.


> Thats because Unicode is STILL a pain to use.

I think that depends heavily on what programming language and frameworks you are using.


Pointers to a comparison across languages would be welcomed. TIA.


http://training.perl.com/OSCON2011/index.html

Specifically the talk titled "Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly"

It's a year old now, but it's still relevant. It gives a very detailed look at unicode support across JavaScript, PHP, Go, Ruby, Python, Java, and Perl.


Tom Christiansen's talk Unicode: The good, the bad, & the (mostly) ugly covers a lot of Unicode wrinkles and how Javascript, PHP, Go, Python, Ruby, Java, and Perl handles them.

Slides for that talk and two other Perl-related Unicode talks are at http://training.perl.com/OSCON2011/index.html


The root of it is that most things still assume ASCII, etc by default. I griped about this a while ago, as a whole, we should be moving all our tools to being UTF-8 by default, and ASCII being the 'odd' case instead of the other way around!


Apple's "textedit" can't open files with these characters in them. It reports "The document “test.txt” could not be opened. Text encoding Unicode (UTF-8) isn’t applicable."


Works fine for me. What version of Mac OS X/TextEdit are you using? Are you sure you are saving it (and opening it) as UTF-8?


Sorry, my mistake. I was saving a file from textmate with the string "𝖙𝖊𝖘𝖙" in it and then opening with "open -a textedit eg.txt".

The same experiment with cat in place of textmate works fine, so it's textmate that is buggy.

According to "od -x1", textmate is writing:

  0000000    ed  a0  b5  ed  b6  99  ed  a0  b5  ed  b6  8a  ed  a0  b5  ed
  0000020    b6  98  ed  a0  b5  ed  b6  99                                
So textedit is right to complain.


Yeah, looks like Textmate is simply running the UTF-8 algorithm over UTF-16 code units, so each surrogate is being turned into a single UTF-8 code unit (which decodes to an invalid character).

It turns out that this is such a common mistake that there's even a name for this encoding, CESU-8: http://en.wikipedia.org/wiki/CESU-8


"Isn't applicable?" What kind of an error message is that?


You don't even want to know the hack I had to implement to get non-BMP characters stored in MySQL 5.1 'utf8' column.


That's exactly why I use the default ISO-8859-1 encoding for all my MySQL tables, try to only stick ASCII into it, and store any Unicode text as binary UTF-8, encoded client-side. It's stupid and shouldn't be necessary, but at least MySQL can't screw it up.


Man, I still find myself reaching for

  dos2unix
all the time. How sad.


These 16bits, UCS, wide char ideas are just plain wrong ! Just use UTF-8 for the files and for communication and 32bits code points internally when needed.


Yeah, so what? UTF-8 is Linux standard. I use it, probably most of web use it(where needed). Why I don't see problems with it in software I use and on the web? I can't remember when I had any problem with UTF-8 last time.


This is a MySQL limitation that they have fixed in recent releases (as the OP notes). It's not fair to blame Unicode at large for MySQL problems.

However, it's true that Unicode is (relatively speaking) very new for such a fundamental technology. Support in applications still varies widely. I wouldn't characterize it as cutting edge though, since we have many mainstream programming languages built using Unicode internally.


> This is a MySQL limitation that they have fixed in recent releases (as the OP notes)

TFA notes that this is not fixed, the `utf-8` mysql encoding still isn't utf-8. And as TFA also notes related technologies (aka drivers) may not be compatible with it (the example he uses, mysql2 for Ruby, still hasn't had an official release supporting utf8mb4[0])

> it's true that Unicode is (relatively speaking) very new for such a fundamental technology

That's becoming quite hard an argument to swallow when encountering astral planes issues in 2012 when Unicode 2.0 was introduced in 1996.

[0] https://github.com/brianmario/mysql2/issues/249


> > it's true that Unicode is (relatively speaking) very new for such a fundamental technology

> That's becoming quite hard an argument to swallow when encountering astral planes issues in 2012 when Unicode 2.0 was introduced in 1996.

I don't get your argument. MySQL was also released around that time and we don't call it "cutting edge" because we found a bug. There are bugs in old stuff all the time but (most) people don't throw a fit.


Unicode is being called 'cutting edge' because it's no longer 'old hat.' Lots of things claim support for Unicode, but few (or none) support it well. Unicode isn't a software project, it's a spec/idea. It's like calling a Star Trek tricorder "cutting edge" because no one has implemented a fully-functional version. Sure the idea has been around for a while, but at this point there's no acceptable manifestations of that idea.


> I don't get your argument. MySQL was also released around that time and we don't call it "cutting edge" because we found a bug.

Not supporting astral planes and saying you're supporting utf-8 is not a bug, it's a lie.


Honestly it's probably better they don't change the behavior of an existing MySQL character set. Who knows what software out there depends on it breaking on 4-byte characters, or whatnot.

Creating a new character set `utf8mb4` was the right thing to do, as annoying as it is. Just clearly label the `utf8` collation as 'deprecated' in the docs or something.


> Honestly it's probably better they don't change the behavior of an existing MySQL character set.

Or they could just have implemented it correctly to start with, considering unicode "support" was introduced in mysql 4.1.

In 2005.

> Who knows what software out there depends on it breaking on 4-byte characters, or whatnot.

Then again, mysql routinely drops and corrupts data anyway, I'm sure its "users" could have dealt with it corrupting data slightly less than before.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: