Hacker News new | past | comments | ask | show | jobs | submit login

ripgrep, and to a lesser extent, GNU grep both do. Whenever you run a query and it seems to execute very quickly, it's almost certainly because of SIMD. GNU grep will use SIMD somehow in many patterns. ripgrep uses it in even more.



Does GNU grep actually make explicit use of SIMD via intrinsics or assembly, or just through autovectorization and/or calling libc methods like memchr that are vectorized under the covers?


Yeah, I was being a bit succinct. As far as I'm aware, GNU grep has no explicit SIMD in it other than memchr through glibc. While some libc implementations utilize auto-vectorization of sorts (musl comes to mind), glibc does have Assembly implementations of memchr for several platforms that do indeed make use of SIMD explicitly.

ripgrep does the same, except for Intel at least, its memchr is implemented in Rust using SIMD intrinsics explicitly. And it also has a specialized SIMD algorithm (taken from Hyperscan mostly) for dealing with multiple patterns: https://github.com/BurntSushi/aho-corasick/tree/8b479a60906d...

Hyperscan takes this to a different level though. It has oodles more SIMD. I should have mentioned it in my original comment.


Thanks for the answer.

I was curious mostly because I never recalled any SIMD intrinsics in GNU code (ok, probably GIMP has them, so maybe I should say GNU utilities), so that would be a first.

It's interesting how much stuff leans on memchr, shame there aren't systematic wider versions taking more bytes to avoid false positives for longer literals (ignoring wmemchr): these could be nice and fast with SIMD.


Yeah I think the wider versions get a lot more complicated. memchr is a bit of a sweet spot, since its implementation is relatively simple. And things like glibc end up implementing specialized versions of it for most architectures _and_ instruction set extensions. (So e.g., there's one for SSE2 and for AVX on x86_64.)

And then of course there's PCMPESTRI (and its variants), but that has largely been a failure because of its high latency. :-( That's a shame, because that instruction does accept substrings up to 16 bytes.


Yeah I had some kind of brain fart thinking say a 4-character memchr() could be just as fast using the native method, but no of course it's 4x as slow (only an "aligned" memchr() like wmemchr() works like that). So yeah, it starts to get complicated quite if you want it to be fast.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: