Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
UTF-n: Brainstorming alternate text encodings (jed.github.com)
9 points by tr4nslator on Feb 1, 2010 | hide | past | favorite | 5 comments


Sure, UTF-8 sucks for Chinese and UTF-16 is bad at English, but in practice, high-codepoint languages are rarely mixed with low-codepoint ones. Notice that when sending an email many mail programs will select the most concise encoding that happens to encompass every character in your message and usually not UTF-8 or UTF-16.


Counterexample: High-codepoint text in HTML or XML 1.0 vocabularies.


Adding some Chinese (from wikipedia) does actually show UTF-n as worst-case compared to UTF-16 and UTF-8, at least. I got UTF-16: 0%, UTF-8: 5%, and UTF-n: 7%.


It would be good to have demo text that plays to UTF-n's advantage, so I don't have to copy-paste from someplace like jp.wikipedia.org myself :)

It looks like it preserves some of UTF-8's stream synchronization properties, but does it have UTF-8's wonderful property of being recognizable by simple heuristics to great confidence even for tiny sequences?


For Russian, the difference between UTF-8 and UTF-n is statistically insignificant.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: