*> I've also written a text editor with a complicated text rope data structure i...

jstimpfle · on July 20, 2021

> No, I'd think that you should make a text rope for UTF-8

But that's what I made. UTF-8 is encoded as bytes. I can store the bytes in the rope just fine.

The rope has a very simple API, basically read() and write() functions, just like a standard FILE I/O API. Do you want to pick on file system developers that they should add APIs for write_UTF8(), write_LATIN1(), write_KOI8(), write_BYTES(), etc.? And then go to network API designers to do the same for the socket I/O functions? And so on? Of course you don't do that, that would be very bad factoring.

And it's just the same for a rope API.

> The more assumptions about your data you've made at the compile time, the less runtime conditioning you'd need, the faster your code would be.

This is true in general, but the rope is just a storage. No processing happens there. The rope couldn't care less what things you store there. There is no point of having multiple identical read/write implementations.

But if you insist, I recommend to ask for an UTF-8 optimized HDD at your local computer shop :-)

> making your program able to deal with different internal encodings is a PITA

If your program has to deal with multiple external encodings, either you can convert at the boundaries to a canonical internal encoding, or you can't in which case it probably becomes a little more work since you have to convert at different places.

This has nothing to do with what I said, though.

jstimpfle · on July 20, 2021

> [...] Char is not a byte. Byte is not a char. [...]

In this paragraph you seem to be confusing the C's "char" with the much more fuzzy idea of "Character" which has like 13 valid definitions.

C's "char" is abstractly defined as the smallest addressable unit of memory available on the machine (required to have at least 8 bits), and historically there have existed 8-bit, 9-bit, 16-bit, or even 36-bit chars. In today's practice it is universally taken synonymous for (8-bit) bytes since all hardware is 8-bit by now. Some people like to be pedantic about the distinction between byte and char, but I most often do not, especially since char is the generally interoperable type in C (with respect to type punning etc.), while uint8_t to my knowledge is not.

"Character" is sometimes understood as "Unicode codepoint" (typically represented as a 32-bit entity, or even as a UTF-8 encoded slice of bytes) or in some cases understood as "Unicode glyph" (probably represented as a slice of codepoints), sometimes understood as even other things.

InfiniteRand · on July 20, 2021

There is a prominent counter example to the 8-bit char: DSP's, TI has several DSPs with 16-bit chars and I think I've encountered one with a 32-bit char. These boards are actually pretty common in industrial settings

FeepingCreature · on July 20, 2021

(This is also what D does.)