> most speed gain tricks in ripgrep have nothing to do with string algorithms
There are actually several tricks at the "string algorithm" level. I didn't invent any of them, but no other search tool combines all of them AFAIK. They are:
* Heuristic for choosing the skip character in boyer moore. (Pick rare bytes.)
* Roll UTF-8 decoding into the regex automaton.
* Use SIMD for multiple string literal search. (This is an unpublished algorithm from Intel's Hyperscan project.)
There was also significant work required for scaling glob matching. Some popular repositories have gitignore files with over 3000 rules.
These are important because ripgrep isn't just for searching code repositories. It should be able to do most things that grep can do, but faster. (The smarter UTF-8 handling is particularly beneficial when searching non-ASCII text.)
Within the context of this thread though, and as someone who has written a linear time suffix array algorithm, the suffix array isn't particularly well suited to the problem of code search. Its constant factors are just too high. Generating an index would take a long time. (Although I haven't read the OP yet.)
> some system call tricks for reading files as quickly as possible.
ripgrep does two different types of searches. One uses memory maps. The other uses standard `read` calls. The latter is actually faster on Linux when searching code repositories. So no real tricks there. :-)
Is your suffix array implementation open source/available online anywhere? I'm always curious to see what other people's version of these sorts of things look like.
(I wrote an implementation of Ukkonen's suffix tree algorithm in C as maybe my second C program, it was pretty frustrating for a long time)
> (I wrote an implementation of Ukkonen's suffix tree algorithm in C as maybe my second C program, it was pretty frustrating for a long time)
Oh dear. You are brave. I briefly considered implementing Ukkonen's algorithm, but ran away scared. :-)
The reason why I wrote the SAIS implementation was because I am generally interested in text search, and because I was interested in exploring the idea of building an index around suffix arrays. Once I finished my implementation and realized that the cutting edge SACA algorithm (at the time) was as slow as it was, I mostly gave up that idea.
Since then, I've been itching to find use cases for it. As a former computational biologist, I'm familiar with how they are used in that field, but I've been looking for other places as well. Haven't found one yet, although I can think of some specialized (hypothetical) things where it might be useful.
There are actually several tricks at the "string algorithm" level. I didn't invent any of them, but no other search tool combines all of them AFAIK. They are:
* Heuristic for choosing the skip character in boyer moore. (Pick rare bytes.)
* Roll UTF-8 decoding into the regex automaton.
* Use SIMD for multiple string literal search. (This is an unpublished algorithm from Intel's Hyperscan project.)
There was also significant work required for scaling glob matching. Some popular repositories have gitignore files with over 3000 rules.
These are important because ripgrep isn't just for searching code repositories. It should be able to do most things that grep can do, but faster. (The smarter UTF-8 handling is particularly beneficial when searching non-ASCII text.)
Within the context of this thread though, and as someone who has written a linear time suffix array algorithm, the suffix array isn't particularly well suited to the problem of code search. Its constant factors are just too high. Generating an index would take a long time. (Although I haven't read the OP yet.)
> some system call tricks for reading files as quickly as possible.
ripgrep does two different types of searches. One uses memory maps. The other uses standard `read` calls. The latter is actually faster on Linux when searching code repositories. So no real tricks there. :-)