Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I've also written a text editor with a complicated text rope data structure in it. Do you think I should have made a different text rope for each different text encoding the editor should deal with?

No, I'd think that you should make a text rope for UTF-8, and then change encoding of text on the boundary. Or, it might be not an UTF8 but some other representation of Unicode, it depends. I see no reason to create a structure for an effective manipulation of strings without choosing a representation of a character at the compile time, because otherwise it would be slower than it might be. The more assumptions about your data you've made at the compile time, the less runtime conditioning you'd need, the faster your code would be.

Believe me, I had dealt with different encodings all the time. I'm Russian, and we had three widely used unibyte encodings for a cyrillic, plus different encodings for Unicode. So one had to deal with all of them all the time. The easiest way is to deal internally with Unicode only and to change encoding on the boundaries where your program communicates with an external world. There (on the boundaries) you can deal with errors, like character which cannot be represented in an output encoding (or cannot be represented in an internal one, but if you use Unicode it wouldn't be a problem). You can treat user input as an input in an external encoding and throw errors if she inputs something that cannot be encoded in an output encoding. It works all the time, while making your program able to deal with different internal encodings is a PITA, with errors thrown from the most unexpected places, with a spaghetti code trying to deliver errors to places where these errors could be sensibly dealt with.

> There isn't a single definition of "char". What even is a "char"? Is it a byte? Is it a unicode codepoint? Is it any other kind of Unicode combination of codepoints or glyphs or combine sequence and emoji modifiers or whatever all that junk is called?

You can use all of them, just pick distinctive names for them, like "char", "glyph", ... and any other you like. But when you did it, you'd want to know where are the boundaries of these things. You'd want to make slices of sequences of these things. If you cannot rely on a validness of underlying UTF8 then you'll be in trouble.

> If you need a specific subsequence of some UTF-8 encoded text, use a library that fetches it from the byte storage. There is no point in making a programming language type, because you'll lock in to certain usages, and next thing you'll need is a completely different type.

When I need to work with bytes, I use an array of bytes. Not a string, but an array of bytes. It was hard to grasp after years of experience with a C, but I did managed it at some point. Char is not a byte. Byte is not a char. Array is not a string, string is not an array. When I need an array to deal with bytes, I use an array. When I need a string to deal with characters, I use a string. It is a non-trivial idea for a C-programmer, because all his experience tells him that character and byte is the same thing. So if character is not a byte, then (he reasons) character doesn't exist.

If characters as codepoints is a too low abstraction for my task, I can create atop of it another abstraction dealing with glyphs, words, tokens, sentences or something. But the abstraction of codepoints must be a library feature, or I'd be forced to create it myself, to validate UTF8 all over the place, and so on. And if that so, then what the point to have an abstraction of string?

If characters as codepoints is a too high abstraction for my task, I can go lower and use an array of bytes.

It is really an easy idea, just C as a language tends to confuse people minds by teaching them that char==int8_t. At least my mind was confused and I managed to untangle that mess completely only around my 30th birthday. And several years later I've found that Rust's std is totally differentiate chars/bytes, strings/arrays as I do. I had fallen in love with Rust immediately.



> No, I'd think that you should make a text rope for UTF-8

But that's what I made. UTF-8 is encoded as bytes. I can store the bytes in the rope just fine.

The rope has a very simple API, basically read() and write() functions, just like a standard FILE I/O API. Do you want to pick on file system developers that they should add APIs for write_UTF8(), write_LATIN1(), write_KOI8(), write_BYTES(), etc.? And then go to network API designers to do the same for the socket I/O functions? And so on? Of course you don't do that, that would be very bad factoring.

And it's just the same for a rope API.

> The more assumptions about your data you've made at the compile time, the less runtime conditioning you'd need, the faster your code would be.

This is true in general, but the rope is just a storage. No processing happens there. The rope couldn't care less what things you store there. There is no point of having multiple identical read/write implementations.

But if you insist, I recommend to ask for an UTF-8 optimized HDD at your local computer shop :-)

> making your program able to deal with different internal encodings is a PITA

If your program has to deal with multiple external encodings, either you can convert at the boundaries to a canonical internal encoding, or you can't in which case it probably becomes a little more work since you have to convert at different places.

This has nothing to do with what I said, though.


> [...] Char is not a byte. Byte is not a char. [...]

In this paragraph you seem to be confusing the C's "char" with the much more fuzzy idea of "Character" which has like 13 valid definitions.

C's "char" is abstractly defined as the smallest addressable unit of memory available on the machine (required to have at least 8 bits), and historically there have existed 8-bit, 9-bit, 16-bit, or even 36-bit chars. In today's practice it is universally taken synonymous for (8-bit) bytes since all hardware is 8-bit by now. Some people like to be pedantic about the distinction between byte and char, but I most often do not, especially since char is the generally interoperable type in C (with respect to type punning etc.), while uint8_t to my knowledge is not.

"Character" is sometimes understood as "Unicode codepoint" (typically represented as a 32-bit entity, or even as a UTF-8 encoded slice of bytes) or in some cases understood as "Unicode glyph" (probably represented as a slice of codepoints), sometimes understood as even other things.


There is a prominent counter example to the 8-bit char: DSP's, TI has several DSPs with 16-bit chars and I think I've encountered one with a 32-bit char. These boards are actually pretty common in industrial settings


(This is also what D does.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: