Hacker News new | past | comments | ask | show | jobs | submit login

I think of something like a comment or an abstract; security is not an issue here because input validation and escaping is done elsewhere.

Basically, I think of a string like "ham, egg." which should result in "ham" and "egg", and "Ветчина, яйцо." should also result in "Ветчина" and "яйцо".

The challenge is that you cannot whitelist all possible characters as there are (imho) too many charsets.




Well barring the practice of specifying meaningful characters, the only thing I can come up with is to have your program use statistics to take it's best guess at what 'special' characters are. Let's say 95% of the characters are between 65 and 90, and every now and then there's a 44-32 pair. Then your program could be pretty sure that 44-32 is a delimiter, and that 65 and 90 are the ranges of characters used in keywords. (The above examples are ASCII).

However, that does nothing to eliminate words like 'in' and 'of' in a query, which you may want to do. It isn't very practical at all, I think, and you probably want to look at more practical ways to list possible delimiters, etc... Although the above could help you determine what charset you're using.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: