If you've been handling Unicode properly in other languages, then Rust strings s...

ajross · on April 14, 2021

Alternatively: if you have been handling Unicode and using wide characters, you have not been handling Unicode properly.

Obviously the world is a big place and there is room for lots of paradigms and worldviews and we aren't supposed to judge too much.

But come on. If new code isn't working naturally in UTF-8 in 2021 then it's wrong, period.

estebank · on April 14, 2021

> if you have been handling Unicode and using wide characters, you have not been handling Unicode properly.

Paradoxically, trying to do "the right thing" and being an "early adopter" of (the now called) UCS-2 was a "mistake", as both Java and Windows can attest, by getting "stuck" supporting the worst possible Unicode encoding ad-infinitum. UTF-8 is the "obviously correct" choice (from the hindsight afforded by us talking about this in 2021).

I still find it funny that emojis of all things are what actually got the anglosphere to actually write software that isn't completely broken for the other 5.5 billion people out there.

mjevans · on April 15, 2021

Correction to the date and timeline. UTF-8 has been the correct choice since:

" >From ken Tue Sep 8 03:22:07 EDT 1992 "

As discussed last week in the quoted history https://news.ycombinator.com/item?id=26735958 of UTF-8's mail messages: https://doc.cat-v.org/bell_labs/utf-8_history

It appears that Unicode arose between 1990 and 1991 for the initial versions https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#... with the first published version of the specification in 1993:

"ISO/IEC 10646-1:1993 = Unicode 1.1"

As discussed another time on Hacker News https://news.ycombinator.com/item?id=20600195 https://unascribed.com/b/2019-08-02-the-tragedy-of-ucs2.html

It was around 1996 when it became clearly obvious (software got shipped to end users who cared and they complained back) that UCS-2 (16 bit characters) would be insufficient.

+++

Pragmatically this is forever, as long as backwards compatibility must be maintained, the exiting APIs which are built around the crazy 16 bit standard need to exist; but there's little reason they have to be native, rather than wrappers for UTF-8 compatible APIs.

It would even be a good time to standardize on a single user space programming API and have implementations on every operating system. Preferably including basic drawing and font layout functions. So that finally, most programs could be written once, compiled on a platform of choice, and work.

RcouF1uZ4gsC · on April 14, 2021

> actually write software that isn't completely broken for the other 5.5 billion people out there

I thought that Chinese and Japanese are the only languages that UCS-2 has trouble fully representing. I believe all the other living languages can actually be represented by UCS-2.

So using UCS-2 would actually work for almost everyone except maybe 1.5 billion people.

estebank · on April 15, 2021

Code that treats UCS-2 as "wide ASCII" can be subtly wrong even for western european languages. For multiple reasons, Unicode has multiple representations for the same glyph[1], so if you have ü, it can either be a the two bytes 00FC[2], or u followed by the ¨ diacritic U+0308. If you don't account for this things like "reversing a string" or "give me a substring" or "how long is the string on screen" will be subtly buggy. If you handle UCS-2 correctly, then it's fine and a reasonable technical limitation, but the emphasis of that sentence is on correctly.

[1]: https://en.wikipedia.org/wiki/Unicode_equivalence

[2]: https://www.compart.com/en/unicode/U+00FC

Dylan16807 · on April 15, 2021

> Code that treats UCS-2 as "wide ASCII" can be subtly wrong even for western european languages.

It can be. But critically for this conversation, fixing your code to support emoji and other non-BMP characters doesn't necessarily fix those problems.

account42 · on April 15, 2021

Except it does since a single emoji can also be made up of multiple code points.

Dylan16807 · on April 15, 2021

But that only causes subtle wrongness, which can continue to go unnoticed.

rectang · on April 15, 2021

> So using UCS-2 would actually work for almost everyone except maybe 1.5 billion people.

One of the great things about using Rust is that I don't have to have this argument. There doesn't have to be a debate about whether we should invest in fixing subtly broken code. Rust string-handling which generally works for Europeans will also work for Chinese!

This isn't the fault of the languages which were designed for UCS-2 (then known just as "Unicode"). But the fact that Rust emerged after UTF-8's ascendance means that Rust's users mostly get to avoid the UCS-2/UTF-16 legacy tarpit.

tsimionescu · on April 15, 2021

> Rust string-handling which generally works for Europeans will also work for Chinese!

I doubt that is any more true than Java. You can easily write code in Rust that assumes you can split a String anywhere and get two valid strings, that you can compute the length of a String and get information about how long the printed representation will be, that you can find a substring in that string by simply iterating through UTF-8 code points etc. All of these assumptions are about as wrong in UTF-8 Rust as they are in UTF-16 Java.

Quekid5 · on April 14, 2021

Do we not consider emoji a language unto itself?

I was actually quite bemused to discover that some code review software I was using allowed me to "cursor" halfway through a smiley face emoji and enter a space (typing too fast to pay attention)... causing the infamous "box characters" because I'd accidentally split the smiley down the middle.

I get the need for extreme backward compat in browsers, but... this seems like one of those things that just might be worth fixing. Maybe a "use utf8" directive? :)

tsimionescu · on April 15, 2021

Even with proper UTF-8, you'll get situations where you can insert a space in the middle of an emoji and split it into other emoji + unprintable characters. The encoding is irrelevant, you need proper Unicode support to avoid these problems.

account42 · on April 15, 2021

Unfortunately proper unicode support for many operations means carrying around tons of data since important properties of code points cannot be derived from the codepoints themselves.

tsimionescu · on April 15, 2021

Exactly. UTF-8 doesn't and cant fix this problem, you need a full Unicode library if you want to correctly handle human text. If you don't, why bother with UTF-8 instead of something simpler, like ASCII?

account42 · on April 16, 2021

> If you don't, why bother with UTF-8 instead of something simpler, like ASCII?

Because unicode support isn't binary. Being able to pass along and not mange blobs of unicode is already a lot better than ASCII-only.

lenkite · on April 15, 2021

Java hasn't been UCS-2 since Java 5.0 - nearly a decade.

TwoBit · on April 15, 2021

Well if memory cost wasn't an issue then UCS-4 would be nicest. But 4x the memory for most strings is currently unacceptable.

lokedhs · on April 15, 2021

That is true, but the benefits of UTF-32 is minor compared to UTF-8, and might not be worth the cost.

And I say that as someone who is developing a language which only has UTF-32 support.

The problem is that even with UTF-32, doing things like splitting strings is inherently unsafe, so you are still going to need a Unicode library to do proper splitting by grapheme cluster. In practice, almost all string splitting works on ASCII text, and assumes everything else is data that should not be manipulated. For this, UTF-8 is perfectly acceptable.

fanf2 · on April 15, 2021

UCS-4 is still a variable-length encoding, because various accented characters and emoji use multiple code points. One advantage of UTF-8 is that it makes you confront variable-length characters head on.

ohazi · on April 15, 2021

There are a lot of instances where I wouldn't mind the memory cost, but would very much mind not automatically being compatible with ascii strings.

samatman · on April 15, 2021

I wouldn't go that far, despite having made an early decision to support only UTF-8 in my own software.

UCS-2 is in fact broken, but UTF-16 is a valid encoding of Unicode, which can be implemented correctly. So is UTF-32, although I can't imagine why anyone would want to use that one.

I can imagine why someone would want to use UTF-16, though: interoperability with Windows, where it's the native encoding. It isn't "wrong, period" to do Unicode in a way which is more convenient for the platform.

There is, of course, a ton of work to really implement Unicode correctly, and UTF-16 and UTF-32 can make it tempting to do the wrong thing, instead of biting the bullet and implementing all of the many ways in which codepoints coalesce into grapheme clusters, and making sure all functions for working with strings can recognize the distinction.

But it certainly can be done in any of the full encodings.

Joker_vD · on April 15, 2021

> interoperability with Windows

Even on Windows, it's much easier to just use UTF-8 internally and perform conversion to/from UTF-16 when you're about to do/return from a WinAPI call.

diroussel · on April 14, 2021

My understanding is that UTF-8 is not a good representation for non-european alphabets.

So do you think UTF-8 is always the best internal string representation? Or just for English speakers?

For Mandarine what would be optimal?

rectang · on April 14, 2021

Mandarin is an interesting case. Most of the Han characters used by Mandarin fall within the basic multilingual plane and thus occupy 2 bytes in UTF-16 but 3 bytes in UTF-8. However, for web documents, most markup is ASCII which is only one byte. So for Mandarin web documents, the space requirements for UTF-8 and UTF-16 are about a wash.

When you add in interoperability concerns, since so much text these days is UTF-8, for Mandarin at least UTF-8 is a perfectly defensible choice.

(A harder problem is Japanese — Japan really got screwed over with Han unification, so choosing Shift-JIS over any Unicode encoding is often best.)

FWIW I covered the space requirements of various encodings and various languages in this talk for Papers We Love Seattle:

https://www.youtube.com/watch?v=mhvaeHoIE24&t=39m14s

amelius · on April 14, 2021

I genuinely wonder: is the space requirement of text encodings really an important issue in this age of large photo and video content?

jcranmer · on April 15, 2021

Looking at my browser's memory reporting, strings take up about ~2-3% of total memory usage, most of which is probably ASCII. If it were UCS-2, that would make it ~5% of total memory usage, and UCS-4 ~10%. That's small numbers, but as a whole-program impact, it's significant enough to motivate performance engineers to actually try to compress those strings down a bit.

amelius · on April 15, 2021

It depends on how strings are counted. If every String object is atomized/interned then e.g. the string "div" is stored once, but on a 64bit system you have 8 bytes for a pointer and another 8 bytes for bookkeeping things such as length.

astrange · on April 15, 2021

Memory for strings is often more important because there's a lot more of them, and image memory can be file-backed more often but strings need to be swapped to disk.

FridgeSeal · on April 15, 2021

If you ever want to do something stuff with text efficiently (full text search and associated processing) I’d argue that it’s quite important.

Dylan16807 · on April 15, 2021

If you want to do search, you need collation, and you can't use any standard encoding for that data.

rectang · on April 15, 2021

It's important to distinguish between sorting in code point order and sorting according what a user would expect for their language. However, sorting in code point order (which is actually equivalent to sorting by memory comparison for UTF-8) is enough to build an inverted index data structure, commonly used for fulltext search. And just like FridgeSeal asserted, the memory footprint of the text representation has performance implications for such an application.

(Source: I wrote a search engine library.)

tsimionescu · on April 15, 2021

Will that handle things like matching accented and unaccented characters? If I search for 'stefan' in a text that makes frequent references to 'Ștefan', will it correctly find those matches?

coldtea · on April 15, 2021

Typically you normalize both to a non-accented format, which is what you index...

rectang · on April 15, 2021

Exactly. And a full-featured fulltext search library will allow for customizable normalization and tokenization.

catblast01 · on April 15, 2021

> (A harder problem is Japanese — Japan really got screwed over with Han unification, so choosing Shift-JIS over any Unicode encoding is often best.)

This statement needs more support. I think “screwed over” is a bit harsh, since I’m not aware the impact on Japanese was anymore than the rest of CJK. Despite the Han unification controversy, Unicode has been heavily adopted in Japan. The space requirements are basically the same as all CJK. Half-width kana is heavier since they are one byte in shift jis but they’re relatively uncommon.

ChrisSD · on April 15, 2021

As far as I'm aware one problem is displaying text. Japanese readers generally need a Japanese font to correctly display Japanese text if it's Unicode. This becomes a problem when you potentially have text that can come from different languages. E.g. a Japanese font will display Chinese incorrectly.

On the web you can work around this using the lang attribute to tell the browser how text should be interpreted.

It's notable that, for example, traditional and simplified Chinese does not have this problem because they are encoded separately.

Another problem is missing characters. Some people have complained of not being able to write their own name. I'm not sure to what extent this has been solved through Unicode updates.

majewsky · on April 15, 2021

I have been learning Japanese for about a year now, so I don't have that much experience reading Japanese text yet [1]. I'm aware of some of the visual differences between Chinese and Japanese fonts, but I have not yet had trouble reading Japanese text set in a Chinese font. If you have any specific examples for Kanji that are difficult to recognize in a Chinese font, I'd be interested.

[1] Although on the other hand you could argue that I'm spending more conscious effort reading Japanese text than a native speaker would.

ChrisSD · on April 15, 2021

Sorry, I can't speak confidently on this because I can't read Japanese (or Chinese or Korean). I can only report what I've been told by users.

Including language metadata with text was felt to be especially important for Japanese and Korean users. I was told the difference was like having "595 kg" displayed as "5P5 kg". That is, it's possible to decipher the intended meaning but it looks wrong and it takes a moment to work out what was meant. Depending on the language some glyphs can be mirror images, have extra strokes, strokes missing or in different places or at different angles.

klodolph · on April 14, 2021

So, the advantage of UTF-16 is that CJK text will use 33% less space.

Does this mean that “UTF-8 is not a good representation for non-European alphabets?” It may be less efficient but the difference does not seem shocking to me, considering that for most applications, the storage required for text is not a major concern—and when it is, you can use compression.

spamizbad · on April 14, 2021

It's the edge-cases that get you.

magicalhippo · on April 14, 2021

> if you have been handling Unicode and using wide characters, you have not been handling Unicode properly.

How so? Delphi for example has wide character-based strings as default, what's wrong with that?

josephg · on April 14, 2021

Wide character based strings have a .length field which is easy to reach for and never what you want, because it’s value is meaningless:

- It isn’t the number of bytes, unless your string only contains ASCII characters. Works in testing, fails in production.

- It isn’t the number of characters because 16 bits isn’t enough space to store the newer Unicode characters. And even if it could, many code sequences (eg emoji) turn multiple code points into a single glyph.

I know all this, and I still get tripped up on a regular basis because .length is right there and works with simple strings I type. I have muscle memory. But no, in javascript at least the correct approaches require thought and sometimes pulling in libraries from npm to just make simple string operations be correct.

Rust does the right thing here. Strings are UTF-8 internally. They check the encoding is valid when they’re created (so you always know if you have a string, it is valid). You have string.chars().count() and other standard ways to figure out byte length and codepoint length and all the other things you want to know, all right there, built into the standard library.

tsimionescu · on April 15, 2021

What could string.chars().count() possibly be used for?

At least .length tells ypu how much memory the string will occupy - not likely to be very important in a memory managed language, but at least it has one potential use. I don't see a use for the number of code points in a string.

josephg · on April 15, 2021

In collaborative editing systems, we usually treat strings semantically as if they were as arrays of codepoints. Its the only sensible way to do it, because the other options are bad:

- We don't use arrays of bytes because different languages have different native string encodings. Converting a UTF8 byte offset to a language that uses UCS2 is slow and complicated, and it opens the door to data corruption.

- And we don't use grapheme clusters because what counts as a grapheme cluster keeps changing with each unicode version. (And libraries are big and complicated).

So inserting an emoji into a document is treated the same way we would handle inserting a small list at some offset into a larger list.

> At least .length tells you how much memory the string will occupy

How much memory the string occupies isn't something I've ever wanted to know. I do sometimes want to know how many bytes the string will take up when I store it or send it over the network - but in those cases you pretty much always want UTF-8. And string.length doesn't help you at all with that.

tsimionescu · on April 15, 2021

If you don't handle grapheme clusters, do you allow users to position the caret between parts of a grapheme cluster (say, between the flag emoji and the country emoji in a country flag emoji)? Is this useful, or just an acceptable limitation? Would it be significantly different if you allowed them to place the caret between the parts of a UTF-16 surrogate pair?

I would have expected that your input method would normally insert grapheme clusters, and that these clusters would be treated as indivisible if they were inserted with a single 'keystroke'. I would also expect that you anyway need to count clusters when presenting something like a 'character count' to the user, as I don't think they would be very happy if a program reported that 'année' was 6 characters long.

> How much memory the string occupies isn't something I've ever wanted to know. I do sometimes want to know how many bytes the string will take up when I store it or send it over the network - but in those cases you pretty much always want UTF-8. And string.length doesn't help you at all with that.

This is what I was referring to, should have called it number of bytes instead of memory. Also note that probably the most used application protocol on the internet, HTTP, doesn't support UTF-8 and defaults to ISO-8859-1 (extended ASCII). For example, HTTP headers are not UTF-8 strings and you can't treat them as such. And even when sending a UTF-8 body the Content-Length header needs to be set to the number of bytes, not the number of unique codepoints, so again you'd use .length.

josephg · on April 15, 2021

> do you allow users to position the caret between parts of a grapheme cluster

Where the carot can go is dependent on the editor, not the underlying protocol.

> I don't think they would be very happy if a program reported that 'année' was 6 characters long.

CRDT / OT edits don't make a lot of sense to the user in their raw form no matter what format you use for offsets. "Insert x at position 5043" is equally meaningless if 5043 stores a byte offset, a grapheme cluster index or codepoint index.

> Would it be significantly different if you allowed them to place the caret between the parts of a UTF-16 surrogate pair?

Yes - if you managed to insert something in the middle of a UTF-16 surrogate pair, the string contents would become invalid - and that causes weird language dependant problems. Rust will panic(). In comparison, a broken country flag just renders weirdly - which isn't ideal, but its fine. Mind you, in both cases you'd need one of the editors to do something weird to insert those characters in the document. As you say, the input method will normally treat grapheme clusters as indivisible anyway. But I much prefer invalid states to be impossible to represent in the first place when I can. I don't want a lack of input validation to allow wonky edits to crash my rust server. Invalid grapheme clusters are a much smaller problem in comparison.

And this all skips over how difficult it is to efficiently convert a UCS-2 offset position in javascript into a position in a UTF-8 string in rust. Its much easier to just count in codepoints everywhere.

> And even when sending a UTF-8 body the Content-Length header needs to be set to the number of bytes, not the number of unique codepoints, so again you'd use .length.

No, that'll break as soon as you insert non ASCII characters into your document. string.length will not tell you the number of bytes your string takes up. It will tell you the number of UCS-2 elements in your string, which is the number of codepoints + 1 for each UTF-16 surrogate pair. Which again, I've never wanted to know. You can't use that to calculate the UTF-8 byte length, where each codepoint takes somewhere between 1 and 4 bytes depending on its unicode table value.

Your example 'année' has a string.length of 5, but a UTF-8 byte length of 6. (From new TextEncoder().encode('année').length). If you set Content-Length to 5, bad things happen. Ask me how I know :/

tsimionescu · on April 15, 2021

> and that causes weird language dependant problems. Rust will panic(). In comparison, a broken country flag just renders weirdly - which isn't ideal, but its fine.

My point was exactly about whether languages should enforce string encoding. I maintain that strings should just be arbitrary byte arrays, and only text processing methods should enforce the appropriate encodings - that would mean that accidentally splitting a UTF-16 code point would be as painful as accidentally splitting a grapheme cluster: anything that wants to render the resulting string will have a problem, but anything in between won't.

> It will tell you the number of UCS-2 elements in your string, which is the number of codepoints + 1 for each UTF-16 surrogate pair.

Oops, here you are completely right. I was under the mistaken assumption that Java String.length() would return the number of 16—byte chars in the string, when in fact it returns the same useless number as the chars().count() method. Sorry about that!

josephg · on April 15, 2021

The problem with languages treating strings as arbitrary byte arrays is that those byte arrays are different in each language. Java uses UCS-2 internally while rust uses UTF-8. Conversions happen somewhere, and converters don’t have a lot of good options when they see invalid data. For collaborative editing, if I send a byte level patch I made in rust for for a UTF8 string, it won’t make much sense in Java.

And a point of clarity - Java’s String.length() does not return the same value as rust’s chars().count(). The former returns the useless UCS-2 count. The latter returns the number of Unicode codepoints. Java’s length() will count many single codepoint emoji as having length 2 (same as javascript, C#) while rust will correctly, usefully count one codepoint as one character. (As will swift and go, depending on which methods you call.)

tsimionescu · on April 15, 2021

*16—bit

barrkel · on April 14, 2021

I was part of that. Delphi has all the string types you want, since you can declare your preferred code page. String is an alias for UnicodeString (to distinguish from COM WideString) and is UTF-16 for compatibility with Win32 API more than anything. UTF-8 would have meant a lot more temporaries and awkward memory management.

magicalhippo · on April 14, 2021

All in all, while the Unicode transition took its time, I must admit it's was very smooth when it did happen.

At work we have a codebase that does a lot of string handling. Both in reading and writing all kinds of text files, as well as doing string operations on entered data. Several hundred kLOC of code across the project.

We had one guy who spent less than week wall-time to move the whole project, and the only issue we've had since is when other people send us crappy data... if I got a dollar for each XML file with encoding="utf-8" in the header and Windows-1252 encoded data we've received I'd have a fair fortune.

estebank · on April 14, 2021

The reasoning behind using UTF-16/UCS-2 is that then you can plug your ears and treat 1 char == 1 user-visible glyph on the screen, so programmers that acted as if ASCII was the only encoding in existence could continue treating strings in the same way (using their length to calculate their user-visible length, indexing directly on specific characters to change them, etc).

All of those practices are immediately wrong once UTF-32 came in existence and UTF-16 became a variable length encoding. But even if that hadn't happened, what you want to be operating on is not characters, but grapheme clusters, which are equivalent to a vector of chars. Otherwise you won't handle the distinction between ë and ë or emojis correctly.

magicalhippo · on April 14, 2021

But how is that different from the underlying encoding being UTF-8?

edit:

For example, we do a lot of string manipulation in Delphi. We might split a string in multiple pieces and glue them together again somehow. But our separators are fixed, say a tab character, or a semicolon. So this stiching and joining is oblivious to whatever emojis and other funky stuff that might be inbetween.

How is this doing it wrong?

I mean yea sure you CAN screw it up by individually manipulating characters. But I don't see how an UTF-8 encoded string in itself prevents you from doing the same kind of mistakes.

josephg · on April 14, 2021

Splitting and glueing is fine. But imagine 3 systems: system A is obviously wrong. It crashes on any input. System B is subtly wrong. It works most of the time, but you’re getting reports that it crashes if you input Korean characters and you don’t know Korean or how to type those characters. System C is correct.

Obviously C is better than A or B, because you want people to have a good experience with your software. But weirdly, system A (broken always) is usually better than system B (broken in weird hard to test ways). The reason is that code that’s broken can be easily debugged and fixed, and will not be shipped to customers until it works. Code that is broken in subtle ways will get shipped and cause user frustration, churn, support calls, and so on.

The problem with UCS-2 is it falls into system B. It works most of the time, for all the languages I can type. It breaks with some inputs I can’t type on my keyboard. So the bugs make it through to production.

UTF-8 is more like system A than system B. You get multibyte code sequences as soon as you leave ASCII, so it’s easier to break. (Though it really took emoji for people to be serious about making everything work.)

magicalhippo · on April 14, 2021

Right, but the claim was this:

if you have been handling Unicode and using wide characters, you have not been handling Unicode properly

I agree that UTF-8 is a better encoding overall for the majority of cases. I don't think that means UTF-16, which for example Delphi UnicodeStrings are[1], is not proper.

edit: maybe this is a language confusion thing. For historically tragic reasons, we're stuck with "char" as the basic element of string types in lots of languages. In Delphi a "widechar" is technically a code unit[2], and may or may not represent a code point. This is how I interpreted the OP. Maybe he meant wide characters as code points, in which I would agree.

[1]: http://docwiki.embarcadero.com/RADStudio/Sydney/en/Unicode_i...

[2]: https://en.wikipedia.org/wiki/Character_encoding#Terminology

josephg · on April 15, 2021

Yeah I hear you. Its definitely possible to write correct code using UCS-2 (where each "char" sometimes represents only half of a codepoint). But its easy to end up with subtly broken code, that only breaks for non-english speakers who don't know enough english to file a bug report.

The ergonomics of the language guide you in that direction when, as you say, a "char" doesn't actually represent a character. Or even an atomic unicode codepoint. And when string.length gives you an essentially meaningless value.

Luckily, code like this will also break when encountering emoji. Thats great, because it means my local users will complain about these bugs and they're easy for me to reproduce. As a result these problems are slowly being fixed.

dragonwriter · on April 15, 2021

> The reason is that code that’s broken can be easily debugged and fixed, and will not be shipped to customers until it works.

Having worked in tech support for a piece of very expensive (~$100k per install annual support/license fee in the late 90s) enterprise software that had a GA release shipped to customers with a syntax error in an install script, I would state that more like “code that’s non-subtly broken is less likely be shipped to customers before it works.”

grishka · on April 15, 2021

> If new code isn't working naturally in UTF-8 in 2021 then it's wrong, period.

UTF-8 is a nuisance as an in-memory representation because the characters are variable size. You can't get the length of a string without parsing it start to end, and you can't get a character by index without parsing and counting all the previous ones. 16-bit characters (wchar_t, Java char, whatever NSStrings are made of, etc) work fine in 99% of the cases.

UTF-8 is indisputably a good encoding for when you're sending something over the network or putting it into a file or a database.

dwheeler · on April 15, 2021

> 16-bit characters... work fine in 99% of the cases.

In other words, they don't work :-).

UTF-16 is also variable-length. Sometimes a character fits in 16 bits, and sometimes it doesn't. From a practical view it's worse than UTF-8, because tests are less likely to detect bugs before shipping.

Even UTF-32 is, in reality, variable-length. Many code points are combining characters, so you need multiple code points to get a single grapheme.

If your language or API requires you to do something, then you'll need to do that. But unless there's an API requirement, in most situations UTF-8 is the best choice for network, storage, and processing. There are exceptions, but they're just that... exceptions.

ChrisSD · on April 15, 2021

UTF-16 is a variable length encoding. As is UTF-32 for that matter. A single character can be made of multiple code points.

TwoBit · on April 15, 2021

True, but text processing is still easier with UTF-32, because UTF-8 and UTF-16 strings need to be converted to UTF-32 before you can do anything with them.

WalterBright · on April 14, 2021

The D programming language has from the beginning built-in support for UTF-8, UTF-16 and UTF-32 code units as basic types (char, wchar, and dchar). This was when it wasn't clear which encoding would dominate.

It's pretty clear today that UTF-8 dominates, and the other two are useful only for interfacing to systems that need them.

shadowgovt · on April 14, 2021

This is my way of thinking about the topic these days. It's not that strings are more complicated in Rust than in other languages, it's that a lot of the other low-level languages are presenting an abstraction that assumes implicitly that a string is some type of sequence of uniform-sized cells, one cell per character, and that representation was an artifact of a specific time in computational history. It's like many other abstractions those languages provide... Seemingly simple at first glance, but if you do the details wrong you're just going to get undefined behavior and your program will be incorrect.

Languages that don't expose strings as that abstraction are, in my humble opinion, more reflective of the underlying concept in the modern era.

tsimionescu · on April 14, 2021

What can you actually do with a known-valid UTF-8 (but otherwise of unkown structure) string that you can't do with a UTF-16 string or even byte-based string?

You can't concatenate, split, enumerate, assume they are valid human text, capitalize, count the number of characters, turn to lower/uppercase etc.

In general, you need a text handling library to do anything meaningful with text if you want to handle internationalization, and then the text handling library can also handle all of the encoding problems easily, all that matters is having a known encoding so you don't need to guess.

It's nice to have specific types for specific encodings of strings, but otherwise I think there is little to be gained by representing strings as anything other than either byte arrays or text.

mjevans · on April 15, 2021

Incorrect.

You CAN concatenate, assuming your inputs are valid the outputs will also be valid; BUT normalization is now ONLY correct if all inputs were of the same format. This can be fixed with libraries if you care, and if you don't, it often doesn't matter. Edit-Additional: If something DOES care, it SHOULD enforce the normalization format it wants on the input boundary.

UTF-16 cannot be split any differently than any other encoding, including UTF-32; all must pass through a library that understands compositing characters and sequences. There is no escaping this (other than your own code becoming such a library).

Most of the other issues you fault are shared by _any_ encoding of Unicode. However a notable thing is that for very specific functions, E.G. searching for a given valid Unicode sequence and replacing it with another... (E.G. replace someone's name) you'll nearly always be able to do this in without issue. To always do it without issue additional checks must be made around combining characters at the boundary edges. (Which I wouldn't want to maintain, so I'd still call a library unless matching against known will never be such a value, such as quotes or other configuration file control characters.)

tsimionescu · on April 15, 2021

> You CAN concatenate, assuming your inputs are valid the outputs will also be valid

Is that true even if you mix LTR and RTL text? I have a suspicion there could be problems there, but I'm happy to be wrong. I would still say that safe concatenation is a minor boon for the cost of enforced UTF-8 compliance.

> Most of the other issues you fault are shared by _any_ encoding of Unicode.

That is exactly my point - that UTF-8, UTF-16, UTF-32, byte strings that could be invalid UTF-8 are all basically equally bad if you want to guarantee meaningful text. A text consisting of two valid UTF-8 code points representing combining characters is no more meaningful than a text consisting of 4 bytes that are invalid UTF-8 code points.

> searching for a given valid Unicode sequence and replacing it with another... (E.G. replace someone's name) you'll nearly always be able to do this in without issue.

I don't agree, unless you are talking about something extremely 'programmatic' like someone's username or email. For real text apps, you'd want something even more complex than Unicode, that could recognize that 'Ștefan' and 'Stefan' are the same name.

mjevans · on April 15, 2021

Blindly concatenating any two byte-streams, irrespective of encodings, would also be an invalid assumption; however at that point you've got massive design issues.

You cannot concatenate any random 16 or 32 bit character sequences as the underlying byte forms might be Little or Big Endian stored. UTF-8 does not need that consideration, BUT you must know (or at least contractually expect) valid UTF-8 input; anything else is garbage in garbage out.

The LTR RTL issue... depends on if all RTL strings _must_ end with an LTR character; in your use case. https://en.wikipedia.org/wiki/Bidirectional_text

Your suggestion of delegating all concatenation operations to that is a safe default. Other options might also be valid, depending on the exact context of inputs and use case.

dhosek · on April 15, 2021

What's really interesting to me is that a rust character is 32 bits even though a rust string is encoded in UTF-8. I'm still just beginning to explore rust so I'd only gotten as far in the book as to learn about characters but not strings and their relationship to UTF-8. This bit of handling of Unicode has me especially intrigued now.

mlindner · on April 15, 2021

I believe the strings are still encoded internally as byte arrays however, it's just that when you pull out a character it can be multiple bytes (emojis for example are often 3 bytes), so you need a 4 byte datatype to store them.

dhosek · on April 16, 2021

Yes, that's what I understand about how it works. There's a lot to be said against the idea of dealing with Unicode at the codepoint level in most applications though. I'm writing code that would allow commands to be either \ + non-letter character or \ + one or more letters (a la TeX). So that means that I would allow commands to include \bird, \pták, \طائر and \鳥, but what happens if the á in \pták is input as ´+ a rather than á? Is that supposed to be the same code? Or perhaps the user has \Spi¨nalTap (and there code editor has a type compositor that's willing to put the umlaut on the n?) Some of this can be dealt with through Unicode normalization, although there's also the question of whether, e.g., the ohm and angstrom symbols should be treated as symbols or letters if they're input at their symbolic code points rather than as a Greek or Latin letters. Would \+white man shrugging be a valid \+non-letter command since it's multiple code points? It's amazing how much a "simple" specification gets complicated when you start to look at all the ways that Unicode can complicate matters.

alerighi · on April 14, 2021

All of this is true, IF you assume that you want Unicode string. That, especially on system/embedded software (the kind of software that Rust is targeting) you don't really care about Unicode and you can simply treat strings as array of bytes.

And I live in a country where you usually use Unicode characters. But for the purpose of the software that I write, I mostly stick with ASCII. For example I use strings to print debug messages to a serial terminal, or read commands from the serial terminal, or to put URL in the code, make HTTP requests, publish on MQTT topics... for all of these application I just use ASCII strings.

Even if I have to represent something on the screen... as long as I have a compiler that supports Unicode as input files (all do these days) I can put Unicode string constants in the code and even print them on screen. It's the terminal (or the GUI I guess, but I don't write software with a GUI) that translates the bytes that I send on the line as Unicode characters.

And yes, of course the length of the string doesn't correspond at the characters shown on the screen... but even with Unicode you cannot say that! You can count (and that what Rust does) how many Unicode code points you have, but a characters could be made of more code points (stupid example, the black emoji is composed by a code point that says "make the following black" and then the emoji itself).

So to me it's pointless, and I care more about knowing how many bytes a string takes and being able to index the string in O(1), or take pointers in the middle of the string (useful when you are parsing some kind of structured data), and so on.

In conclusion Rust is better when you have to handle Unicode string, but most application doesn't have to handle them, and handling them I don't mean passing them around as a black box, not caring how they contain (yes, in theory you should care about not truncating the string in the middle of a code point when truncating strings... in reality, how often do you truncate strings?)

pezezin · on April 15, 2021

It's funny that you mention MQTT, as the spec requires strings (including topic names) to be encoded in UTF-8: http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/cos02/mqtt-v3.1....

Granted, ASCII is a subset of UTF-8, so as long as you control all the publishers and enforce the ASCII-only rule, you should be ok. But if some day you need to integrate third-party systems and they use characters outside of ASCII...

stouset · on April 14, 2021

Literally nothing is stopping you from using `&[u8]` for 7-bit ASCII.

estebank · on April 15, 2021

> So to me it's pointless, and I care more about knowing how many bytes a string takes and being able to index the string in O(1), or take pointers in the middle of the string (useful when you are parsing some kind of structured data), and so on.

Which is why you can still sub-slice Strings[1]

> Literally nothing is stopping you from using `&[u8]` for 7-bit ASCII.

Not only that, there's first party support for literals that are that[2].

[1]: https://play.rust-lang.org/?version=stable&mode=debug&editio...

[2]: https://play.rust-lang.org/?version=stable&mode=debug&editio...

burntsushi · on April 15, 2021

Can you explain more specifically how Rust's String/str type does not handle your ASCII case? They are just bytes and ASCII is a subset of UTF-8.