Unicode normalizing library to parse attacker text as English

LuisMondragon · on Sept 29, 2021

Wow, there was a thread today about moderating user generated content, a HN comment told how users try to get away using symbols from other languages that look like English to post offensive content. I was wondering if one could make an ML system that takes into account the appearance of words. And now I find your post, which is doing something like that, albeit with regexes. Interesting!

wodenokoto · on Sept 29, 2021

A lot of this can be achieved by using Unicode normalization [1] which is already built into Python [2].

I’m not saying this library isn’t more comprehensive (it’s been years since I’ve played with this)

1: https://unicode.org/reports/tr15/#Canon_Compat_Equivalence

2: https://docs.python.org/3.9/library/unicodedata.html