Hacker News new | past | comments | ask | show | jobs | submit login

Much easier said than done. Try to actually do that. A good start would be IDS.TXT I've linked above; this gives about 500 approximately atomic characters to start with [1]. Now extend this to the entirety of Ideographic Variation Database [2], which tries to solve the most cited problems with Han unification. And then add some kind of semantic annotation as you've suggested (I have no idea, maybe you have a better idea).

[1] 123 components that are already encoded, 121 components ({01}, ...) that are partial characters, ~123 "unpresentable" components (?), and ~115 "minor" variations (〾) that may require additional components or two.

[2] https://unicode.org/ivd/




> Much easier said than done. Try to actually do that

Ha-ha, thanks for the offer, but I'll pass - all that complexity is precisely why I'm not even going to try, it requires a lot of dedicated team effort put into it, a simple "try" is doomed to fail.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: