Hacker News new | past | comments | ask | show | jobs | submit login

Technically, for characters whose codepoint exceeds 0xFFFF, javascript treats them as two characters. To see that, consider the Sushi character "🍣" (U+1f363):

    "🍣".length // 2
    "🍣".charCodeAt(0) // 55356
    "🍣".charCodeAt(1) // 57187



That's a bad interface that allows you to split strings at useless codepoints and get illegal UTF-16 strings as the result.


It's the historical interface which websites now rely on, changing it would be like writing a libc with strcmp operating on Pascal strings.

In any case, a Javascript String is not actually designed to be UTF-16, it is essentially just an `uint16_t[]`. Even textual strings just store UTF-16 code units, not full UTF-16 data. Relevant snippets from the standard:

The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values ("elements").

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. [...] All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

See also:

- Section 8.4 http://www.ecma-international.org/publications/files/ECMA-ST...

- http://mathiasbynens.be/notes/javascript-encoding


> Although the standard does state that Strings with textual data are supposed to be UTF-16.

No, it doesn't. It states that they're UTF-16 code units, a term defined in Unicode (see D77; essentially an unsigned 16-bit integer), which is not the same as UTF-16. A sequence of 16-bit code units can therefore include lone surrogates, which something encoded in UTF-16 could not.


Oh, yes; I just skimmed 'code unit' bit without actually reading. (I've now removed the misinformation from my previous comment.)


I think JS may be from the time when UCS-2 was all there was and there were only 65535 Unicode characters.


It's needed for compatibility with the Web, unfortunately.


Definitely. Thankfully, ES6 will introduce

    "🍣".codePointAt(0)
and iterators will iterate code points, not code units.

    for (var c of "🍣"){}


Funny how this Sushi character appears as nigiri or as maki depending on the font.

I can't say I'm really satisfied with the current state of emojis in Unicode.


The name of the character is simply ‘SUSHI’ ( http://www.unicode.org/charts/PDF/U1F300.pdf ). Any pictographic representation of sushi would fulfil that.


Oh I'm not saying it's wrong, just too imprecise (actually, since in France "sushi" is often synonymous with nigiri, when I posted the character earlier in a chatroom, someone made the remark that they were "maki, not sushi").

Also, what about "🏤" which is "U+1F3E4 EUROPEAN POST OFFICE"? I see it here as a box with some kind of horn, Deutsche Post's logo as far as I know. Is this supposed to be localized in the future so that I can see the French Post's bird instead?

What is not satisfying is that the emojis feel both too incomplete (great, there's an eggplant and a tomato, now where's the bell pepper?) and too imprecise (okay, I have this nice maki emoji to show what I'm eating... oh wait, am I sure my friend will actually see maki?).

And sometimes they're just plain weird, what about "😤 U+1F624 FACE WITH LOOK OF TRIUMPH"? In all fonts I can find it looks like someone who's mightily pissed, maybe fuming because he spent so much time looking for the perfect emoji, only for their friend to see something completely different. That doesn't look like triumph to me.


A stylized bugle is a fairly universal symbol for the postal services in Europe, at least historically. I can't find a complete overview, but it looks like France is one of the very few exceptions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: