Can unicode be implemented in a thousand lines or so of C?

vidarh · on Dec 25, 2023

Define implementing Unicode. If you want to support rtl/bidi and grapheme clusters and every little detail, probably not.

But 99% of the utility for most people is there if you can find the right column, and. move left and right by character instead of byte, and can output UTF8 sequences correctly. In C it's a minor pain, but not impossible.

adestefan · on Dec 26, 2023

Libgrapheme will get you there for the most part. It will let you find characters, words, sentences. It’s nice because it returns byte offsets so you can use them directly for C data structures. I wish there was a way to get the number of characters along with byte offsets which helps with things like line breaking.

https://libs.suckless.org/libgrapheme/

JonChesterfield · on Dec 26, 2023

Thank you for this reference. Deriving a freestanding C99 implementation from the standard sounds great. That can probably be rendered as a single source file to drop into a project that otherwise deals only in ascii.

vidarh · on Dec 26, 2023

Yeah, but you've added a dependency, which the author doesn't seem to want. If you're willing to add dependencies there are multiple other options too.

torstenvl · on Dec 26, 2023

Rudimentary BMP support, probably. You basically have to account for combining characters and double-width characters.

Emojis and multi-character* presentation form code points like U+FDFD would take a lot more work to do correctly.

* I'm using "character" here in the linguistic sense. Unicode did not invent the word "character" and it doesn't only mean code points.