Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The byte size allowed would need to be about 100x the length limit. That’s… kind of a lot?

Would it need to be, though? ~10x ought to be enough for any realistic string that wasn't especially crafted to be annoying.



Valid question, and I think you're right in the abstract and most of the time. But I also think you end up with a mismatch.

What's the concrete spec for the limit if you've only got 10x storage per grapheme cluster?

Probably you end providing the limit in bytes. That's fine, but it's no longer the "hybrid counting" thing anymore.


They show a single Hindi character that is 15 bytes in UTF-8. That's enough over 10 that it would be believable that Hindi words could get uncomfortably close to the 10x limit.


Triple conjuncts are very uncommon in Indic scripts, though there are a few in common use, like stri is a single-syllable word that means woman or wife in many languages. Pick your Indic script, and that’ll be LETTER SA, SIGN VIRAMA, LETTER TA, SIGN VIRAMA, LETTER RA, VOWEL SIGN I. Most Indic syllables/grapheme clusters are a single consonant and a single vowel sign, if not the inherent vowel -a. Conjuncts use their script’s SIGN VIRAMA to suppress the inherent vowel and normally graphically join the next consonant (an orthographic choice rarely broken, a little like ß being ss in German).

I’m not so confident about Hindi, though 25% seems very low if we’re talking frequency; but in Telugu writing it’s definitely a lot more than that that specify a vowel sign and thus take at least two Unicode scalar values to represent a syllable.

My feeling (as a white fellow moved to India, with well above average knowledge of Indian languages and Unicode for a place like HN, but not yet fluent in any Indian language) is that some four-bytes-per-code-point script might conceivably get realistic existing texts above an average of 10 bytes per syllable for at least twenty syllables, and that most Indic languages could sustain it indefinitely in specific deliberate styles of writing.


A single hindi character, yes. But they also mention that only ~25% of hindi characters use combining marks.


Most of them are vowels. They're pretty common. (Also, I feel like you of all people would understand the issues with "only 25% of the time this happens, therefore surprising behavior at the edges is unlikely to happen".)


That's why you have a limit on both.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: