Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because that's the Punycode representation:

https://en.wikipedia.org/wiki/Punycode

https://www.punycoder.com/



I wasn't aware of this, I'd seen those URLs before but only in the context of Chinese ones and thought it was Chinese-specific.

It's interesting because I just went down an apparent rabbit hole inplementing Byte-level encoding for using language models with unicode. There each byte in a unicode character is mapped to a printable character that goes up to 255 < ord(x) < 511 (I don't remember the highest but the point is each byte is mapped to another printable unicode character.

See https://github.com/openai/gpt-2/blob/9b63575ef42771a015060c9...

And the actual list of characters:

https://github.com/rbitr/llm.f90/blob/dev/phi2/phi2/pretoken...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: