Hacker News new | past | comments | ask | show | jobs | submit login

I'd appreciate an analysis of how it compresses! The encoding looks highly compressible, so I'd expect it to be competitive with UTF-8 for English text, and seems like it would beat it for East Asian languages.



At first glance I'd suggest it doesn't compress at all. Especially if the compression uses bytes.

Bit-streams of characters of non-bytes length become randomish when viewed as bytes, and random bytes contain no redundancy and can't be compressed.


> Especially if the compression uses bytes

Arithmetic compression can use whatever, even fractional bits, and it's been around since the seventies.


Arithmetic coding goes one token/symbol at a time, just like most kinds of compression. The fractional bits come after token selection, and aren't really relevant here.

You can split the input into tokens that aren't a multiple of 8 bits, sure. But that's its own decision. 7 or 21 or whatever bit tokens could be fed into a huffman tree just as easily.


Yes, but if you run a normal compression algorithm like gz, rar or 7z on it, it's still going to use bytes.


Arithmetic compression uses whatever on the output. Of course you can retokenize weird input but you can usually do so for any algo if you can modify it. But UTF21 can not have a substantial advantage if you compress. It will usually be worse.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: