Technically, for characters whose codepoint exceeds 0xFFFF, javascript treats th...

iopq · on July 22, 2014

That's a bad interface that allows you to split strings at useless codepoints and get illegal UTF-16 strings as the result.

dbaupp · on July 22, 2014

It's the historical interface which websites now rely on, changing it would be like writing a libc with strcmp operating on Pascal strings.

In any case, a Javascript String is not actually designed to be UTF-16, it is essentially just an `uint16_t[]`. Even textual strings just store UTF-16 code units, not full UTF-16 data. Relevant snippets from the standard:

The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values ("elements").

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. [...] All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

See also:

- Section 8.4 http://www.ecma-international.org/publications/files/ECMA-ST...

- http://mathiasbynens.be/notes/javascript-encoding

gsnedders · on July 22, 2014

> Although the standard does state that Strings with textual data are supposed to be UTF-16.

No, it doesn't. It states that they're UTF-16 code units, a term defined in Unicode (see D77; essentially an unsigned 16-bit integer), which is not the same as UTF-16. A sequence of 16-bit code units can therefore include lone surrogates, which something encoded in UTF-16 could not.

dbaupp · on July 22, 2014

Oh, yes; I just skimmed 'code unit' bit without actually reading. (I've now removed the misinformation from my previous comment.)

TazeTSchnitzel · on July 22, 2014

I think JS may be from the time when UCS-2 was all there was and there were only 65535 Unicode characters.

pcwalton · on July 22, 2014

It's needed for compatibility with the Web, unfortunately.

loganfsmyth · on July 22, 2014

Definitely. Thankfully, ES6 will introduce

    "🍣".codePointAt(0)

and iterators will iterate code points, not code units.

    for (var c of "🍣"){}

seszett · on July 22, 2014

Funny how this Sushi character appears as nigiri or as maki depending on the font.

I can't say I'm really satisfied with the current state of emojis in Unicode.

robin_reala · on July 22, 2014

The name of the character is simply ‘SUSHI’ ( http://www.unicode.org/charts/PDF/U1F300.pdf ). Any pictographic representation of sushi would fulfil that.

seszett · on July 22, 2014

Oh I'm not saying it's wrong, just too imprecise (actually, since in France "sushi" is often synonymous with nigiri, when I posted the character earlier in a chatroom, someone made the remark that they were "maki, not sushi").

Also, what about "🏤" which is "U+1F3E4 EUROPEAN POST OFFICE"? I see it here as a box with some kind of horn, Deutsche Post's logo as far as I know. Is this supposed to be localized in the future so that I can see the French Post's bird instead?

What is not satisfying is that the emojis feel both too incomplete (great, there's an eggplant and a tomato, now where's the bell pepper?) and too imprecise (okay, I have this nice maki emoji to show what I'm eating... oh wait, am I sure my friend will actually see maki?).

And sometimes they're just plain weird, what about "😤 U+1F624 FACE WITH LOOK OF TRIUMPH"? In all fonts I can find it looks like someone who's mightily pissed, maybe fuming because he spent so much time looking for the perfect emoji, only for their friend to see something completely different. That doesn't look like triumph to me.

gpvos · on July 22, 2014

A stylized bugle is a fairly universal symbol for the postal services in Europe, at least historically. I can't find a complete overview, but it looks like France is one of the very few exceptions.