Hacker News new | past | comments | ask | show | jobs | submit login

The primary reason their code is faster as they've made incorrect code that assumes you always have a UTF-8 locale. The normal isspace does not assume this:

> The real programs spend most of their time in functions like mbrtowc() to parse multi-byte characters and iswspace() to test if they are spaces

This will produce bad results in the real world. People have previously posted about bugs related to this on Hacker News: https://news.ycombinator.com/item?id=36216389




Are we looking at the same thread? Folks there seem to be complaining the old interfaces are anything from outdated to confusing to simply wrong, and I agree.

I think it's totally reasonable for a program designed in 2024 to say it only supports ASCII and UTF-8 encodings. Whether/how it should support the full spectrum of Unicode definitions of characters/graphemes/... is more interesting. For a lot of backend code (processing text-based network protocols, for example), focusing on ASCII is arguably best (e.g. functions like isspace and isdigit only returning true for ASCII characters). For more user-focused stuff, most of the world would say they'd like their native language supported well. Programs written with both use cases in mind should probably have a switch. (Or parallel APIs: e.g. Rust has {u8,char}::is_ascii_digit vs. char::is_numeric, which makes much more sense there than having one that switches behaviors based on an environment variable, as the correct behavior really depends on the call site.)

Of course "say it only supports ASCII and UTF-8 encodings" is mutually exclusive with claiming to be a drop-in replacement for a utility that is specified as having POSIX locale support. This project does not make that claim, or even that it's intended to be actually used rather than illustrative of a performance technique.


> I think it's totally reasonable for a program designed in 2024 to say it only supports ASCII and UTF-8 encodings.

I think that it should depends on the program; that might be reasonable for some programs but in a lot of cases I think that it won't be reasonable. (Your explanation includes some of the examples, although not all of them.)

Sometimes, it is most helpful to support only ASCII (although non-ASCII bytes might still be supported, even without needing special processing to handle them; in some cases this may effectively allow other ASCII-compatible encodings as well such as EUC-JP).

Sometimes, a program should not need to deal with character encoding at all.

Sometimes, it makes sense to deal with whatever character encodings are used in the file formats the program is designed to handle.

Sometimes, it makes sense to support multiple character encodings, with or without conversion (depending on what is being done with them).

Even if a program does only support ASCII and UTF-8 encodings, then depending on what it does with them, mentioning ASCII might be unnecessary since UTF-8 is a superset of ASCII anyways.

But unfortunately many programs use UTF-8 (or other encodings, but mostly UTF-8) where it is inappropriate to do so, which can result in many problems including inefficiency.


It would be just as fast if it detected the locale correctly, and used the UTF8 state machine for the UTF8 locale only. Other locales could have their own state machines or just fall back to a generic implementation.


I'm wondering if you could easily build the state machine quickly at startup time. Unless it is a multi byte encoding it should be trivial.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: