Currently the main branch is undergoing a refactor to add support for having custom extractors (calling out to other tools), and more flexible chains of extractors.
Ripgrep itself has functionality integrated to call custom extractors with the `--pre` flag, but by adding it here we can retain the benefits of the rga wrapper (more accurate file type matchers, caching, recursion into archives, adapter chaining, no slow shell scripts in between, etc).
Sadly, during rewriting it to allow this, I kind of got hung up and couldn't manage to figure out how to cleanly design that in Rust. I'd be really glad if a Rust expert could help me out here:
In the currently stable version, the main interface of each "adapter" is `fn(Read, Write) -> ()`. To allow custom adapter chaining I have to change it to be `fn(Read) -> Read` where each chained adapter wraps the read stream and converts it while reading. But then I get issues with how to handle threading etc, as well as a random deadlock that I haven't figured out how to solve so far :/
> In the currently stable version, the main interface of each "adapter" is `fn(Read, Write) -> ()`. To allow custom adapter chaining I have to change it to be `fn(Read) -> Read` where each chained adapter wraps the read stream and converts it while reading. But then I get issues with how to handle threading etc, as well as a random deadlock that I haven't figured out how to solve so far :/
I don't quite grok the problem here. If you file an issue against ripgrep proper with code links and some more details, I can try to assist.
Taken literally, ripgrep uses that exact same approach. There are potentially multiple adapters being used. Each adapter is just defined to wrap a `std::io::Read` implementation, and the adapter in turn implements `std::io::Read` so that it can be composed with others. The part that I'm missing is why this has anything to do with threading or deadlocks. I/O adapters shouldn't be having anything to do with synchronization. So I'm probably misunderstanding your problem.
> If you file an issue against ripgrep proper with code links and some more details
Sorry, I don't think I explained my issue very well. In general it has nothing to do with the interaction with ripgrep, that works fine.
It's that each adapter (e.g. zip -> list of file streams) needs to have an interface of fn(Read) -> Iter<ReadWithMeta>
But then if there's a PDF within the zip, I have to give the returned ReadWithMeta to the PDF adapter - but it can't take ownership, because the Archive file iterators only give borrowed reads. I maybe worked around this by creating a wrapper type [3] and adding an unsafe here [2], but something deadlocks when adapting zip files currently.
Also, for external programs, I have to copy the data from the Read into a Write (stdin of the program) - which needs to happen in a separate thread, otherwise the stdout is never read [1], but some Reads I have aren't Send since they come from e.g. zip-rs, so they can't be passed to a thread.
I don't have time to dig into this right now, but I just wanted to say that I did at least gather that it wasn't an interaction problem with ripgrep. :-) I just figured the ripgrep issue tracker would be a good place to discuss it. But now that I think about it, the ripgrep-all issue tracker might be a better spot. Maybe post an issue there and ping me? That way we can chat with email notifications and what not.
One possibility is the almost dirt-simple solution wherein you just have a "make"/"Makefile" (or your favorite other build system) maintain a shadow tree of parallel pre-translated files. You get parallelism via `make -j$(nproc)` or its equivalent.
Every name in the shadow is built from the name in the origin but maybe with ".txt" added (or .txt.gz if you want to keep the compressed with whatever is the fastest decompressor builtin to ripgrep as a library not called as a program). Untranslated names can be just symbolic/hard links back to the origin. Build rules become as flexible as your build system.
This also scales to deployments that have more disk space than memory. Admittedly, in that case, the whole procedure probably becomes disk-IO bound, but maybe not. Maybe some translations cannot even keep up with disk IO - NVMe storage is pretty fast, for example. Or available memory may vary dynamically a lot, sometimes allowing the shadow to be fully in the buffer cache, other times not. It strikes me as less presumptuous to assume you can find disk space vs. having that much memory available. (EDIT2: though I may be confused about how `rga` operates - your doc says "memory cache", though.)
On the pro-side, but for updating the shadows based on origins, the user could even just `rg` from within the shadow and translate filenames "in their head", although stripping an always present string is obviously trivial. Indeed, you won't need `rg --pre` at all and the grep itself could become pluggable. I doubt any of your other `fzf`/etc. integrations would be made more complicated by this design, either.
This all strikes me as simple/nice enough that someone has probably already done it...EDIT1: Oh, I see from thumbs ups and other comments over at [1] and [2] that @phiresky is probably already aware of this design idea, but maybe some HN person knows of an existing solution along these lines.
Thanks for this tool, I'm already getting a ton of use from it.
For fun, I pointed a 12-core/32GB RAM 2018 MBP at a 9GB network share full of PDFs, while still using the laptop for other things (so not a benchmark, just an anecdote).
Initial cold/uncached run:
rga -j 12 testword share 1140.70s user 77.58s system 31% cpu 1:03:55.85 total
Cached:
rga -j 12 testword share 8.09s user 4.88s system 92% cpu 14.048 total
AUR has both a ripgrep-all [1] and ripgrep-all-bin [2] package. Both were addded by you. The bin package has a newer version. What is the difference between the two?
Love this. I appreciate your building on ripgrep versus my own bulky lucene-based approach a while back (https://github.com/maximz/sift), and that you don’t require pre-indexing but build up a cache as you go.
thanks but it's way faster to have my stuff in G drive
that way I can open a browser tab, wait 5 seconds for it to load, locate the new screen location of the search bar, click it, wait for javascript to finish loading so I can click the search bar, click it for real this time, mistype because there's some kind of contenteditable event jank, wait 5 seconds for my results to come up, fix the typo, and just have my results waiting for me
I'm not going to learn a new tool when web is fine
Firefox supports custom search engines, the most bang for the buck custom search engine must be https://duckduckgo.com/?q=%s with keyword being the letter d. Then you get all these 13000+ bangs without having to configure the custom search engines. E.g. write "d !drive term" in url bar. And "d !w hacker news" sends you directly to https://en.wikipedia.org/wiki/Hacker_News
Firefox keyword search has one little known killer feature: You can combine it with data URIs and JavaScript to run small "command line snippets" stored in your bookmarks from your browser bar.
To get started, create a keyword search from any form (like the search bar on duckduckgo.com) and edit the URL of the entry in the bookmark manager to point to
data:text/html,<script>alert("%s")</script>
instead.
What you can do with this is (fortunately) limited by cross-origin restrictions but there are some useful applications. For example, I use this snippet
if god wanted me to access my files in less than 15 seconds, they wouldn't have commanded google to package the search bar as a separate JS bundle that only gets downloaded when you focus the search bar
I'm no frontend dev but I know a thing or two about HTML + there's no built-in way to input text into a box -- this is the best we can do and we'll just have to wait for 5G + moore's law to solve this
Laugh all you want but try looking for a Fullstack/Frontend role in today's job market. What do they want? AnGuLaRr with oBsErVaBlEs! Why do they want it? Because Google can't be wrong.
I would rather slit my wrist than using AnGuLaRr. Google is notorious for over engineering problems, great for search, horrible for UI/UX stuff. Keep it simple stupid.
Death to SPA (Angular, React)
Long live SPA (Mithril, Vue)
Good in what sense? A sarcastic endorsement of A that is indistinguishable from an earnest one is a poor argument for ~A.
This is in fact what's happened with Schrodinger's Cat: it was meant as an argument from absurdity against the Copenhagen Interpretation of quantum mechanics, but it's presented seriously and so people take it in that way.
> A sarcastic endorsement of A that is indistinguishable from an earnest one is a poor argument for ~A.
The purpose of sarcasm is not to make an argument, it's to have fun. The best fun is had when exactly half of the audience does not get the joke (as the other half makes fun of them).
I'm not sure on how many level this statement is off. Sure, if its file where you don't care about your privacy, the kind of files you don't mind posting on facebook, then sure put in G Drive. But don't think for a second that those files in the cloud is yours. It is NOT! Especially when we are talking about FANNG here. You will have NO legal protection!
A lot of us don't want our stuff on G-drive for privacy and security concerns. Tools like this are valuable to us. It's an old problem and there are plenty indexers out there, this more real-time scan is more than welcome to join the bunch of course.
I love that we’re seeing fast & flexible solutions for personal search.
I’ve recently been playing with Recoll for full-text-search on content. Since it indexes content up front, the search is pretty fast. It can also easily accommodate tag metadata on files.
It would be interesting to consider how ripgrep based tools can fit into generically broad “search your database of content” workflows (as opposed to remember or go through your file system paths).
FZF + ripgrep is really killer for me. I don't even bother organizing my notes anymore, I just throw everything markdown files in a flat directory and then I have a script that uses FZF + ripgrep to search through it when I need it. I search by "last modified first" so unless I'm digging for something very old the results are instant. Code snippets, finances, TODO lists, cake recipes... It's all in there.
I use the same system in Vim to browse source code. It's very powerful, very fast, works with any language and requires zero configuration.
You'll need to set NOTES_DIR in your environment to wherever you want your notes to be stored. Then you can write `note something` to create or open $NOTES_DIR/something.md with your $EDITOR.
If you type "note" without parameter you'll start a search on all the note names, ordered by last use. If you type "note -f" it starts a full text search.
For best results you should have the fzf.vim's preview.sh somewhere in your fs, otherwise it'll use "cat" but it won't be as good looking (see FZF_PREVIEW in the script).
Hopefully despite being shell it should be readable enough to tweak to your liking.
Note that it was written and used exclusively on Linux, but I did try to avoid GNU-isms so hopefully it should work on BSDs and maybe even on MacOS with a bit of luck.
rga also indexes them when you search. To be honest I like that approach a lot more since it saves space and I generally know where I'm looking for things
ls -sh ~/.cache/rga/
total 336M
336M data.mdb 4.0K lock.mdb
That kind of caching is an interesting solution to incrementally building a database instead of spending hours up-front indexing. So the tool is ready for immediate use. Quite nifty :-)
For mlocate you can edit /etc/updatedb.conf to specify what to index. One trick I use is "locate -Ai" that lets you search for multiple patterns and makes it case insensitive. So you can use "locate -Ai linux .pdf" to search for all pdf files related to Linux.
Also for gnome there is tracker which does search and indexing built into the system. I think by default its set for minimal use but it can be configured by the settings/search panel to index many locations. I haven't played with is much recently though.
Great tip.. thanks! I've kinda of mounted all my drives including Windows/NTFS/etc using fstab now. Do you reckon this will have any negative impact performancewise?
Just wondering since Linux knows about these drives but doesn't mount them automatically at startup.. so if this is out of a reason or just convention?
fd (https://github.com/sharkdp/fd) is the best command line search utility IMO. Its crazy fast and always found what I was looking for.
If you want a GUI alternative, check out Drill (https://github.com/yatima1460/Drill).
Although the development seems stalled, it works well for normal usecases.
Hmm.. I seem to remember creating an excel file for this client a while back.. open Everything -> filter client.xlsx .. boom. Or maybe I didn't name it properly, at all? Well still just a simple '*.xlsx' and sort by date, I can generally find anything this way. As long as you let Everything open on windows startup, it will be instant after use.
To traverse my files I use the combo ranger + autojump. It is not GUI and you need to traverse a directory at least once before accessing it automatically, but I just wanted to mention this. Another (CLI) software that seem to do what you want is fzf.
Seriously - I miss it as well. But my access patterns have changed as well. I spend more time on the terminal, and with autojump, the alternatives (with similar features) on Linux aren't really that useful to my usage.
Big fan of rga! I use it almost every day for the academic part of my life, when I want to know the location of some specific keywords in my lecture slides, books or papers I've been reading. Even for single ebooks, it is often more useful than the search in Acrobat Reader.
The search in PDF viewers is an anti-feature in terms of UI and performance. Their advantage is that they allow to scroll to and highlight the found phrase back in the document.
$ sudo dnf install -y ripgrep-all
[...]
No match for argument: ripgrep-all
Error: Unable to find a match: ripgrep-all
Rust's package manager fails:
$ cargo install ripgrep_all
[...]
failed to select a version for the requirement `cachedir = "^0.1.1"`
candidate versions found which didn't match: 0.2.0
location searched: crates.io index
required by package `ripgrep_all v0.9.6`
Quick search on the web shows that more people have problems with cachedir version.
It looks like cachedir yanked version 0.1.1. This is usually only done when a very serious issue is discovered, though I don't know what the reason is in this case.
You can do cargo install --locked ripgrep_all as a workaround. It uses the lockfile that's part of the ripgrep_all package, so you miss out on some package updates, but can also get the cachedir version required.
There is a github issue to make this the default behaviour of cargo, but you miss out on updates which might fix security bugs so the cargo team is unwilling to change the default.
Idea behind Rga is cool.
Anyway, I tried it on Mac and installed via Homebrew. The formula already says it depends on ripgrep (that's fine since I have ripgrep already installed and use it regularly). I still was surprised when I executed Rga for the first time and got an error message that 'pdftotext' was not found. Since pdftotext has been officially discontinued, I am not sure if I want to install an old version just to make Rga work on my machine. Don't think it's an good idea to rely on a project which is not maintained actively.
I don't see any indication that pdftotext has been discontinued [1]. It looks like a Mac-specific installer available via Homebrew Cask has been discontinued [2], but pdftotext is still available through the normal poppler formula [3].
That looks like a problem with that specific package, and not pdftotext that is in poppler. I don't even know what that package is. It links to bluem.net?
Yeah, In my opinion poppler should be a dependency of rga in homebrew (since it's kinda useless without having the default adapters), but I don't maintain that package.
rga uses pdftotext (from poppler) internally for pdfs, except wraps it in parallelization and a very fast cache layer, since you usually want to do multiple queries per file :)
If anyone is interested gron [0], I have an open PR [1] to add it as an adapter to ripgrep-all. The patch was based on the most recent release, since master is currently not functional.
I noticed that you can use Tesseract as an OCR adapter for rga. Tesseract is written in python, IIRC, and in the OP it comes with a warning that it’s slow and not enabled by default. Are there any other fast, reliable OCR libs out there? Or any rust OCR backends?
I don't think the problem necessarily is that Tesseract is slow, but that the whole process of rendering a PDF to a series of PNGs on which you can then run OCR is slow (which is what it does in the background).
The process of converting all pages to raster images and then OCR-ing each one takes hours for PDFs hundreds of pages long. This workflow is not suitable for instant search. For non OCR-ed PDFs it's worth to pregenerate the text.
That's why rga comes with a cache. I've occasionally used the Tesseract adapter with good success (results-wise), and after the inital rendering and indexing, it's fast enough to use.
Can it (or any tool) perform proximity searches on scanned PDFs? E.g word1 within 20 words of word2, on scanned PDFs? (I think this is non trivial but very useful.)
Scanned PDFs only work well if they already have an OCR layer. There's some optional integration of rga with tesseract, but it's pretty slow and less good than external OCR tools.
ripgrep-all can do the same regexes as rg on any filetypes it supports. So you can could do something like --multiline and foo(\w+[\s\n]+){,20}bar
It won't work exactly like this, but something similar should do it:
--multiline enables multiline matching
* foo searches for foo
* \w+ searches for at least one word character
* [\W]+ searches for at least one space/nonword character like sentence marks
* {,20} searches for at most 20 iterations of the word-space combination
bar searches for bar
If its a scanned PDF (essentially a collection of 1 image per page), there would need to be an OCR step to get some text out first. Tesseract would work for this.
Once that's done, you have all the options available to perform that search. But I don't know of a search tool that does the OCR for you. I did read a blog post of someone uploading PDFs to google drive (they OCR them on upload) as an easy way to do this.
For PDFs, how does it (does it?) deal with for example, when phrases get ripped apart by the layout? Like if you search for a multiple word phrase, it's often foiled by word wrap or being in a table.
can it produce links to open the file yet (don't know rust, so can't add a PR easily). At least gnome-terminal supports that (and normally it should also support opening a specific pdf page)!
This is great. I have 100+ ebooks/pdfs of programming and textbooks of which I've been extracting the index pages of. My intention was to always make some sort of search index out of them. I will definitely be trialing this (initial few searches seem promising!)
Curious why this isn't a pull request to ripgrep? Maybe it was, and rejected? It'd be nice to just have one tool, and this doesn't feel like it's a stretch to add to ripgrep.
Why is there an expectation that every application should be free or cheap? IMHO $60 is very reasonable for a program that can save a lot of time for the user. And developers also have to eat, and might want to some day retire.
Something perhaps more helpful but so far unmentioned (and somewhat OS-specific) is that statically linked executables usually fork & exec (especially exec) much faster than dynamically linked ones. This difference is usually only like 50..150 us vs 500..3000 us but can multiply up over thousands of files.
This only matters on the first run of `rga`, of course. While the dispatched-to decoder is likely mostly out of one's linking control, this overhead can be saved for the dispatcher, at least. So, I would suggest `rga-preproc` should have a static linking option/suggestion, at least on Linux.
Of course, this overhead may also fall below the noise of PDF/ebook/etc. parsing, but maybe not the decompression of small files in some dark horse format. :-)
It would be nice to have a direct comparison with ugrep. In the case of rg the benchmarks are already enough to switch. Why should I use rga instead of ugrep?
Just to be clear, I meant that I had switched to ripgrep because its speed was convincing enough on its own (so I did not even extra features to switch).
I'm currently not using any of ugrep or rga, although I have used pdfgrep in the past. It'd be nice for casual users like me to know more about why I should use rga over ugrep (or vice-versa).
I can understand it might be nice to have a personal library of PDF books and searching in them. I can't think of a time I've ever wished I could search my bookshelf in that way, but you never know.
Obviously I use tools like ripgrep for searching codebases and the like.
But the extreme flexibility of this one in particular (and others like MacOs Spotlight) makes it seem more like a data recovery tool for me. If my directory structures and databases ever completely failed for some reason I might need to search through everything to find the data again. It's good to know such tools exist, I suppose.
But my fear is that tools like this teach people to not worry about organisation of data and to just fill up their disks with no structure at all. I think that unless something goes terribly wrong nobody should ever need a tool like this. Once you rely on it, you're out of luck it if it ever fails you. What if you just can't remember a single searchable phrase from some document, but you just know it must exist somewhere?
It's similar to what Google has done to the web. When I was growing up it used to be a skill to use the web. People used tools like bookmarks and followed links from one place to another. Now it's just type it into Google and if Google doesn't know, it doesn't exist.
Hierarchal organizing of data is not a productive way of organization. Simply due to how much information people accumulate and often times structures breakdown.
It's more intuitive to simple search for something in the something you are looking for and clicking it.
I haven't used a folder organization structure in many many years. Other than the defaults for my cloud folders and a separation between Personal + Work.
I mean, I understand what you mean when it comes to Google -- the web essentially becomes locked into a particular proprietary solution to finding information. I definitely still have hundreds (maybe into the thousands?) of bookmarks of sites that store information I care about.
But I don't think this tool deserves the same sort of mixed feelings. I don't think this replaces structure -- there's still value to having a conceptual mapping of where documents are stored, and for grouping sets of documents together. It's just that having a structure doesn't help if you don't know where in the structure something is stored. This sort of tool is a bottom-up approach for the times when the top-down approach doesn't work very well.
Do you have similarly mixed feelings if sometimes, even with my carefully-crafted set of bookmarks with all their nested folders, I use the search tool to find the bookmark I'm looking for? It's the same idea. Sometimes a top-down structure is beneficial. But sometimes things get misclassified, or you forget about some piece of the structure, or you aren't familiar with some new structure, and in those cases, having bottom-up tools are immensely useful. There's no risk of vendor lock-in here. It's just a difference of approach in information retrieval.
There is nothing wrong with the original Google's postulate. Your local search results are less likely to be hijacked by entities bidding for your attention. I agree with the argument for organizing the data anyway.
Currently the main branch is undergoing a refactor to add support for having custom extractors (calling out to other tools), and more flexible chains of extractors.
Ripgrep itself has functionality integrated to call custom extractors with the `--pre` flag, but by adding it here we can retain the benefits of the rga wrapper (more accurate file type matchers, caching, recursion into archives, adapter chaining, no slow shell scripts in between, etc).
Sadly, during rewriting it to allow this, I kind of got hung up and couldn't manage to figure out how to cleanly design that in Rust. I'd be really glad if a Rust expert could help me out here:
In the currently stable version, the main interface of each "adapter" is `fn(Read, Write) -> ()`. To allow custom adapter chaining I have to change it to be `fn(Read) -> Read` where each chained adapter wraps the read stream and converts it while reading. But then I get issues with how to handle threading etc, as well as a random deadlock that I haven't figured out how to solve so far :/