Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Per unit of time is true, because languages with a lesser information density will be spoken at a faster rate in order to compensate (see: Spanish speakers speaking way faster than English speakers). But LLMs don't ingest data per unit of time, they ingest data via text.


Sure, but, depending on your representation, you can only have so many different tokens, and if you use, say, pinyin, then you don't have much of an advantage over English.

It'll be interesting to see how LLMs do when trained on concepts (ie the Chinese alphabet), rather than sounds, though.


> Sure, but, depending on your representation, you can only have so many different tokens

From my understanding (which only comes from reading HN), unique token count isn't an issue that LLMs run into. If it is, then that would be a bummer for the possibilities of exploiting all the features of the Chinese language.

> and if you use, say, pinyin

Well yeah pinyin would be inefficient because you're stripping away all the "built in" semantics of a Chinese character to only focus on the pronounciation.


I imagine you can always expand your token count, but then is that very different from syllabic alphabets (that encode multiple syllables in one character)?

Granted, that doesn't apply to Chinese, which encoders concepts, so that's interesting to see in LLMs.


I think so, because even if you're encoding multiple syllables, that's just the pronounciation.

I've learned languages with alphabets and glyph scripts like Chinese, and *what I don't like about alphabet languages like English is that the letters themselves provide little context to their meaning, although you could learn the latin root words and guess from there. Of course you know how to say it, but you don't know what it means. With Chinese, you might guess what it means, but you don't know how to say it.* The characters have more meaning in them, although this isn't the case for every word, which is something non-speakers assume.

In short, from my experience learning Chinese was way easier than learning Vietnamese, which have historical ties together since Vietnamese used a fork of Chinese script until the early 1900s when they transitioned to latin script in order to improve literacy rates. Sure, it improved literacy rates because to read you only need to know how to pronounce, but it doesn't mean glyph languages are harder to learn the meaning/semantics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: