Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unicode normalizing library to parse attacker text as English (github.com/wanderingstan)
7 points by fragmede on Sept 28, 2021 | hide | past | favorite | 2 comments



Wow, there was a thread today about moderating user generated content, a HN comment told how users try to get away using symbols from other languages that look like English to post offensive content. I was wondering if one could make an ML system that takes into account the appearance of words. And now I find your post, which is doing something like that, albeit with regexes. Interesting!


A lot of this can be achieved by using Unicode normalization [1] which is already built into Python [2].

I’m not saying this library isn’t more comprehensive (it’s been years since I’ve played with this)

1: https://unicode.org/reports/tr15/#Canon_Compat_Equivalence

2: https://docs.python.org/3.9/library/unicodedata.html




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: