Rga: Ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz

phiresky · on Dec 2, 2020

Developer of the tool here :) Glad to see it posted here, I still actively use it myself. Also check out the fzf integration in the README: https://github.com/phiresky/ripgrep-all/blob/master/doc/rga-...

Currently the main branch is undergoing a refactor to add support for having custom extractors (calling out to other tools), and more flexible chains of extractors.

Ripgrep itself has functionality integrated to call custom extractors with the `--pre` flag, but by adding it here we can retain the benefits of the rga wrapper (more accurate file type matchers, caching, recursion into archives, adapter chaining, no slow shell scripts in between, etc).

Sadly, during rewriting it to allow this, I kind of got hung up and couldn't manage to figure out how to cleanly design that in Rust. I'd be really glad if a Rust expert could help me out here:

In the currently stable version, the main interface of each "adapter" is `fn(Read, Write) -> ()`. To allow custom adapter chaining I have to change it to be `fn(Read) -> Read` where each chained adapter wraps the read stream and converts it while reading. But then I get issues with how to handle threading etc, as well as a random deadlock that I haven't figured out how to solve so far :/

burntsushi · on Dec 2, 2020

> In the currently stable version, the main interface of each "adapter" is `fn(Read, Write) -> ()`. To allow custom adapter chaining I have to change it to be `fn(Read) -> Read` where each chained adapter wraps the read stream and converts it while reading. But then I get issues with how to handle threading etc, as well as a random deadlock that I haven't figured out how to solve so far :/

I don't quite grok the problem here. If you file an issue against ripgrep proper with code links and some more details, I can try to assist.

Taken literally, ripgrep uses that exact same approach. There are potentially multiple adapters being used. Each adapter is just defined to wrap a `std::io::Read` implementation, and the adapter in turn implements `std::io::Read` so that it can be composed with others. The part that I'm missing is why this has anything to do with threading or deadlocks. I/O adapters shouldn't be having anything to do with synchronization. So I'm probably misunderstanding your problem.

phiresky · on Dec 2, 2020

> If you file an issue against ripgrep proper with code links and some more details

Sorry, I don't think I explained my issue very well. In general it has nothing to do with the interaction with ripgrep, that works fine.

It's that each adapter (e.g. zip -> list of file streams) needs to have an interface of fn(Read) -> Iter<ReadWithMeta>

But then if there's a PDF within the zip, I have to give the returned ReadWithMeta to the PDF adapter - but it can't take ownership, because the Archive file iterators only give borrowed reads. I maybe worked around this by creating a wrapper type [3] and adding an unsafe here [2], but something deadlocks when adapting zip files currently.

Also, for external programs, I have to copy the data from the Read into a Write (stdin of the program) - which needs to happen in a separate thread, otherwise the stdout is never read [1], but some Reads I have aren't Send since they come from e.g. zip-rs, so they can't be passed to a thread.

[1] https://github.com/phiresky/ripgrep-all/blob/baca166fdab3d24...

[2] https://github.com/phiresky/ripgrep-all/blob/baca166fdab3d24...

[3] https://github.com/phiresky/ripgrep-all/blob/baca166fdab3d24...

burntsushi · on Dec 3, 2020

I don't have time to dig into this right now, but I just wanted to say that I did at least gather that it wasn't an interaction problem with ripgrep. :-) I just figured the ripgrep issue tracker would be a good place to discuss it. But now that I think about it, the ripgrep-all issue tracker might be a better spot. Maybe post an issue there and ping me? That way we can chat with email notifications and what not.

one-punch · on Dec 2, 2020

The integration with fzf seems nice.

Any plans to integrate with skim, a Rust implementation of fzf?

https://github.com/lotabout/skim

cutemonster · on Dec 3, 2020

I also hope for that :-)

pdimitar · on Dec 3, 2020

Seconding this, I love fzf but I love skim a little more.

cb321 · on Dec 2, 2020

One possibility is the almost dirt-simple solution wherein you just have a "make"/"Makefile" (or your favorite other build system) maintain a shadow tree of parallel pre-translated files. You get parallelism via `make -j$(nproc)` or its equivalent.

Every name in the shadow is built from the name in the origin but maybe with ".txt" added (or .txt.gz if you want to keep the compressed with whatever is the fastest decompressor builtin to ripgrep as a library not called as a program). Untranslated names can be just symbolic/hard links back to the origin. Build rules become as flexible as your build system.

This also scales to deployments that have more disk space than memory. Admittedly, in that case, the whole procedure probably becomes disk-IO bound, but maybe not. Maybe some translations cannot even keep up with disk IO - NVMe storage is pretty fast, for example. Or available memory may vary dynamically a lot, sometimes allowing the shadow to be fully in the buffer cache, other times not. It strikes me as less presumptuous to assume you can find disk space vs. having that much memory available. (EDIT2: though I may be confused about how `rga` operates - your doc says "memory cache", though.)

On the pro-side, but for updating the shadows based on origins, the user could even just `rg` from within the shadow and translate filenames "in their head", although stripping an always present string is obviously trivial. Indeed, you won't need `rg --pre` at all and the grep itself could become pluggable. I doubt any of your other `fzf`/etc. integrations would be made more complicated by this design, either.

This all strikes me as simple/nice enough that someone has probably already done it...EDIT1: Oh, I see from thumbs ups and other comments over at [1] and [2] that @phiresky is probably already aware of this design idea, but maybe some HN person knows of an existing solution along these lines.

[1] https://github.com/BurntSushi/ripgrep/issues/978 [2] https://github.com/BurntSushi/ripgrep/pull/981

trynewideas · on Dec 7, 2020

Thanks for this tool, I'm already getting a ton of use from it.

For fun, I pointed a 12-core/32GB RAM 2018 MBP at a 9GB network share full of PDFs, while still using the laptop for other things (so not a benchmark, just an anecdote).

Initial cold/uncached run:

rga -j 12 testword share 1140.70s user 77.58s system 31% cpu 1:03:55.85 total

Cached:

rga -j 12 testword share 8.09s user 4.88s system 92% cpu 14.048 total

Cache after the run is 77M.

aembleton · on Dec 2, 2020

AUR has both a ripgrep-all [1] and ripgrep-all-bin [2] package. Both were addded by you. The bin package has a newer version. What is the difference between the two?

1. https://aur.archlinux.org/packages/ripgrep-all/

2. https://aur.archlinux.org/packages/ripgrep-all-bin

jhardy54 · on Dec 3, 2020

You can (and should!) read the PKGBUILDs, they're very small and should always be manually inspected before install.

The -bin suffix is an AUR convention to let you know that it's downloading a precompiled binary rather than building from source.

scaladev · on Dec 3, 2020

If we're talking about conventions, it's also an AUR convention to use -vcs suffix for source builds (like -git), for example:

https://aur.archlinux.org/packages/?O=0&SeB=nd&K=-git&outdat...

https://aur.archlinux.org/packages/?O=0&SeB=nd&K=-hg&outdate...

kevincox · on Dec 3, 2020

I thought (but don't know if I ever actually say this documented) that -git and similar meant "latest git commit" not "build form a git repo".

mxmilkb · on Dec 3, 2020

And the version of VCS AUR entries is just from the last time the PKGBUILD author ran it before uploading.

maximz · on Dec 2, 2020

Love this. I appreciate your building on ripgrep versus my own bulky lucene-based approach a while back (https://github.com/maximz/sift), and that you don’t require pre-indexing but build up a cache as you go.

awinter-py · on Dec 2, 2020

thanks but it's way faster to have my stuff in G drive

that way I can open a browser tab, wait 5 seconds for it to load, locate the new screen location of the search bar, click it, wait for javascript to finish loading so I can click the search bar, click it for real this time, mistype because there's some kind of contenteditable event jank, wait 5 seconds for my results to come up, fix the typo, and just have my results waiting for me

I'm not going to learn a new tool when web is fine

ffpip · on Dec 2, 2020

If you're using Duckduckgo, just search ''!drive search-term'' or ''search term !drive'' or ''search !drive term''

More !operators here - https://duckduckgo.com/bang

spappal · on Dec 2, 2020

Firefox supports custom search engines, the most bang for the buck custom search engine must be https://duckduckgo.com/?q=%s with keyword being the letter d. Then you get all these 13000+ bangs without having to configure the custom search engines. E.g. write "d !drive term" in url bar. And "d !w hacker news" sends you directly to https://en.wikipedia.org/wiki/Hacker_News

t0astbread · on Dec 3, 2020

Slightly off-topic but this is an important PSA.

Firefox keyword search has one little known killer feature: You can combine it with data URIs and JavaScript to run small "command line snippets" stored in your bookmarks from your browser bar.

To get started, create a keyword search from any form (like the search bar on duckduckgo.com) and edit the URL of the entry in the bookmark manager to point to

  data:text/html,<script>alert("%s")</script>

instead.

What you can do with this is (fortunately) limited by cross-origin restrictions but there are some useful applications. For example, I use this snippet

  data:text/html,<script>i="%s";firstSep=i.indexOf(" ");if(firstSep==-1)firstSep=i.length;subreddit=i.substr(0,firstSep);searchTerm=i.substr(firstSep+1);location=`https://old.reddit.com/r/${encodeURIComponent(subreddit)}${(searchTerm!=null&&searchTerm.replace(/\s/,"").length>0)?`/search?q=${encodeURIComponent(searchTerm)}&restrict_sr=1`:""}`</script>

as a nice shortcut to Reddit ("<keyword> <subreddit>" to jump to a Subreddit, "<keyword> <subreddit> <search string>" to search within a Subreddit).

You can also insert content directly into the document which opens the possibility for instant marquee

  data:text/html,<marquee>%s</marquee>

codethief · on Dec 2, 2020

Or you just set DDG as your default search engine and then you don't even have to type the "d" anymore. :)

diarrhea · on Dec 3, 2020

This would be amazing, but gives me '405 - Not Allowed - nginx', see also [0]. Strange.

[0] https://bugzilla.mozilla.org/show_bug.cgi?id=1625901

EDIT: That error occurred when going Right Click -> Add Keyword from the website. If setting the bookmark manually completely, it works.

kevincox · on Dec 3, 2020

Or just create a bookmark for https://drive.google.com/drive/search?q=% and type `drive term` or whatever you call it without bouncing through DDG.

ffpip · on Dec 4, 2020

This works on every device you own. You don't need to setup anything.

And there are thousands more at duckduckgo.com/bang .

Tokkemon · on Dec 2, 2020

Can't tell if this is sarcasm.

awinter-py · on Dec 2, 2020

not being sarcastic

if god wanted me to access my files in less than 15 seconds, they wouldn't have commanded google to package the search bar as a separate JS bundle that only gets downloaded when you focus the search bar

I'm no frontend dev but I know a thing or two about HTML + there's no built-in way to input text into a box -- this is the best we can do and we'll just have to wait for 5G + moore's law to solve this

chrisweekly · on Dec 2, 2020

> "I'm no frontend dev but I know a thing or two about HTML + there's no built-in way to input text into a box"

hahaha, nice one (continued)

durnygbur · on Dec 2, 2020

Laugh all you want but try looking for a Fullstack/Frontend role in today's job market. What do they want? AnGuLaRr with oBsErVaBlEs! Why do they want it? Because Google can't be wrong.

awinter-py · on Dec 2, 2020

wait actually? my sense is that react is leading

nicoburns · on Dec 3, 2020

It is. Although there is also a fair amount of Angular work. It's a big red flag for me though.

jhoechtl · on Dec 3, 2020

> Why do they want it?

Because every obersable event you trigger has to go into an add machinery

durnygbur · on Dec 3, 2020

That would be the only reason for so awfully overengineering something - it's created by an advertising company.

0df8dkdf · on Dec 2, 2020

I would rather slit my wrist than using AnGuLaRr. Google is notorious for over engineering problems, great for search, horrible for UI/UX stuff. Keep it simple stupid.

Death to SPA (Angular, React) Long live SPA (Mithril, Vue)

murermader · on Dec 2, 2020

I don‘t know if you are joking, but this is clearly sarcasm.

read_if_gay_ · on Dec 2, 2020

You're saying you can't tell it's sarcasm that he can't tell it's sarcasm?

enriquto · on Dec 2, 2020

This means it's good sarcasm!

The best sarcasm lies on a ridge, you cannot tell if it's sarcasm or not.

andreareina · on Dec 3, 2020

Good in what sense? A sarcastic endorsement of A that is indistinguishable from an earnest one is a poor argument for ~A.

This is in fact what's happened with Schrodinger's Cat: it was meant as an argument from absurdity against the Copenhagen Interpretation of quantum mechanics, but it's presented seriously and so people take it in that way.

enriquto · on Dec 3, 2020

> A sarcastic endorsement of A that is indistinguishable from an earnest one is a poor argument for ~A.

The purpose of sarcasm is not to make an argument, it's to have fun. The best fun is had when exactly half of the audience does not get the joke (as the other half makes fun of them).

hs86 · on Dec 2, 2020

If you use Chrome, this might help: https://www.androidpolice.com/2019/12/04/chrome-omnibox-will...

For GSuite/Workspace this needs to be enabled by an admin: https://support.google.com/a/answer/9121487?hl=en

mosselman · on Dec 3, 2020

But what if you aren't uploading your personal data into Google's privacy invasion system?

0df8dkdf · on Dec 2, 2020

I'm not sure on how many level this statement is off. Sure, if its file where you don't care about your privacy, the kind of files you don't mind posting on facebook, then sure put in G Drive. But don't think for a second that those files in the cloud is yours. It is NOT! Especially when we are talking about FANNG here. You will have NO legal protection!

Stop using cloud, usb is fine.

chance_state · on Dec 3, 2020

It's a joke.

stjohnswarts · on Dec 2, 2020

A lot of us don't want our stuff on G-drive for privacy and security concerns. Tools like this are valuable to us. It's an old problem and there are plenty indexers out there, this more real-time scan is more than welcome to join the bunch of course.

xvector · on Dec 3, 2020

It’s sarcasm

ssivark · on Dec 2, 2020

I love that we’re seeing fast & flexible solutions for personal search.

I’ve recently been playing with Recoll for full-text-search on content. Since it indexes content up front, the search is pretty fast. It can also easily accommodate tag metadata on files.

It would be interesting to consider how ripgrep based tools can fit into generically broad “search your database of content” workflows (as opposed to remember or go through your file system paths).

simias · on Dec 2, 2020

FZF + ripgrep is really killer for me. I don't even bother organizing my notes anymore, I just throw everything markdown files in a flat directory and then I have a script that uses FZF + ripgrep to search through it when I need it. I search by "last modified first" so unless I'm digging for something very old the results are instant. Code snippets, finances, TODO lists, cake recipes... It's all in there.

I use the same system in Vim to browse source code. It's very powerful, very fast, works with any language and requires zero configuration.

tasuki · on Dec 2, 2020

About a year ago, I discovered it was very helpful for me to have git branches ordered by "recently modified first":

From my `~/.gitconfig`:

    [alias]
        brt = "!git for-each-ref refs/heads --color=always --sort -committerdate --format='%(HEAD)%(color:reset);%(color:yellow)%(refname:short)%(color:reset);%(contents:subject);%(color:green)(%(committerdate:relative))%(color:blue);<%(authorname)>' | column -t -s ';'"

I always spent a lot of time being confused about branches, and never realised how easy the solution was.

rurban · on Dec 3, 2020

I prefer the version without column

    brt = "!git for-each-ref refs/heads --color=always --sort -committerdate --format='%(HEAD)%(color:reset) %(color:yellow)%(refname:short)%(color:reset) %(contents:subject) %(color:green)(%(committerdate:relative))%(color:blue) <%(authorname)>'"

simias · on Dec 2, 2020

Oh that's a great idea, I'd definitely stealing that, thanks!

nicoburns · on Dec 3, 2020

Amazing. This should be the default!

rshm · on Dec 2, 2020

Can you share your script.

simias · on Dec 2, 2020

This is the main one (that actually only uses FZF, not ripgrep): https://gist.github.com/simias/b1d8356469d2a9386deeb7c45984b...

You'll need to set NOTES_DIR in your environment to wherever you want your notes to be stored. Then you can write `note something` to create or open $NOTES_DIR/something.md with your $EDITOR.

If you type "note" without parameter you'll start a search on all the note names, ordered by last use. If you type "note -f" it starts a full text search.

For best results you should have the fzf.vim's preview.sh somewhere in your fs, otherwise it'll use "cat" but it won't be as good looking (see FZF_PREVIEW in the script).

Hopefully despite being shell it should be readable enough to tweak to your liking.

Note that it was written and used exclusively on Linux, but I did try to avoid GNU-isms so hopefully it should work on BSDs and maybe even on MacOS with a bit of luck.

maxioatic · on Dec 2, 2020

I'd love some more info as well!

polyrand · on Dec 2, 2020

I couldn't agree more with that. I wrapped a bash function to search through my notes folder with fzf + rg and it works perfectly.

Also, I have a specific pattern to write some tags inside files that I can parse with ripgrep.

rjzzleep · on Dec 2, 2020

rga also indexes them when you search. To be honest I like that approach a lot more since it saves space and I generally know where I'm looking for things

    ls -sh ~/.cache/rga/
    total 336M
    336M data.mdb  4.0K lock.mdb

ssivark · on Dec 2, 2020

That kind of caching is an interesting solution to incrementally building a database instead of spending hours up-front indexing. So the tool is ready for immediate use. Quite nifty :-)

curious_tenet · on Dec 2, 2020

Wow that is so cool!

ghoomketu · on Dec 2, 2020

One a related note there is one program that I absolutely miss on Linux called everything (on windows).

The closest I can find is mlocate but it does not have a GUI but more importantly it does not index my Windows or NTFS drives.

Would appreciate any suggestions if someone knows something like 'everything' for Ubuntu.

ronjouch · on Dec 2, 2020

Everything is great on Windows to pick files/folders.

From the linux command-line, I like fzf ( https://github.com/junegunn/fzf ), that you can instruct to use the faster fd ( https://github.com/junegunn/fzf#environment-variables ). Fzf even offers keybindings for your shell. For example, it binds Alt+C to fuzzy-finding a directory, and Enter cds to it ( https://github.com/junegunn/fzf#key-bindings-for-command-lin... ).

Fzf is great for other things too; here is a fish function to bing Alt+G to fuzzy-pick a Git branch and jump to it:

   function fish_user_key_bindings
     bind \eg 'test -d .git; or git rev-parse --git-dir > /dev/null 2>&1; and git checkout (string trim -- (git branch | fzf)); and commandline -f repaint'
     bind \eG 'test -d .git; or git rev-parse --git-dir > /dev/null 2>&1; and git checkout (string trim -- (git branch --all | fzf)); and commandline -f repaint'
   end

opan · on Dec 2, 2020

Haven't used this, but heard of it years ago, and it aims to be similar. https://github.com/dotheevo/angrysearch/

ghoomketu · on Dec 2, 2020

Thanks! That's super nice and very close to what I was looking for all this time.

I just learned how to mount all my Windows drive under /mnt using (using the `disks` software), so hopefully this should index those files too.

pkaye · on Dec 2, 2020

For mlocate you can edit /etc/updatedb.conf to specify what to index. One trick I use is "locate -Ai" that lets you search for multiple patterns and makes it case insensitive. So you can use "locate -Ai linux .pdf" to search for all pdf files related to Linux.

Also for gnome there is tracker which does search and indexing built into the system. I think by default its set for minimal use but it can be configured by the settings/search panel to index many locations. I haven't played with is much recently though.

ghoomketu · on Dec 3, 2020

Great tip.. thanks! I've kinda of mounted all my drives including Windows/NTFS/etc using fstab now. Do you reckon this will have any negative impact performancewise?

Just wondering since Linux knows about these drives but doesn't mount them automatically at startup.. so if this is out of a reason or just convention?

shscs911 · on Dec 2, 2020

fd (https://github.com/sharkdp/fd) is the best command line search utility IMO. Its crazy fast and always found what I was looking for. If you want a GUI alternative, check out Drill (https://github.com/yatima1460/Drill). Although the development seems stalled, it works well for normal usecases.

mrlala · on Dec 2, 2020

'Everything' is a LIFE SAVER.

Hmm.. I seem to remember creating an excel file for this client a while back.. open Everything -> filter client.xlsx .. boom. Or maybe I didn't name it properly, at all? Well still just a simple '*.xlsx' and sort by date, I can generally find anything this way. As long as you let Everything open on windows startup, it will be instant after use.

afarviral · on Dec 3, 2020

FSearch is inspired by Everything. https://github.com/cboxdoerfer/fsearch

For general file search at ludicious speeds like Everything does on windows its pretty good :)

pabs3 · on Dec 2, 2020

BTW, mlocate is obsoleted by plocate, which is much faster and is actually maintained.

https://plocate.sesse.net/

ubercow13 · on Dec 3, 2020

plocate depends on mlocate, it doesn't replace it

Sesse__ · on Dec 3, 2020

Hi,

I'm the author of plocate. Since version 1.1.0, plocate no longer depends on mlocate for building its database, but is a full replacement.

ubercow13 · on Dec 4, 2020

Awesome! Thank you.

RMPR · on Dec 2, 2020

To traverse my files I use the combo ranger + autojump. It is not GUI and you need to traverse a directory at least once before accessing it automatically, but I just wanted to mention this. Another (CLI) software that seem to do what you want is fzf.

captn3m0 · on Dec 2, 2020

Seriously - I miss it as well. But my access patterns have changed as well. I spend more time on the terminal, and with autojump, the alternatives (with similar features) on Linux aren't really that useful to my usage.

hobofan · on Dec 2, 2020

Big fan of rga! I use it almost every day for the academic part of my life, when I want to know the location of some specific keywords in my lecture slides, books or papers I've been reading. Even for single ebooks, it is often more useful than the search in Acrobat Reader.

durnygbur · on Dec 2, 2020

> search in Acrobat Reader

The search in PDF viewers is an anti-feature in terms of UI and performance. Their advantage is that they allow to scroll to and highlight the found phrase back in the document.

solstice · on Dec 2, 2020

The search in Tracker Software's PDF X-Change Viewer/Editor is really great. Effective and easy to use

mssdvd · on Dec 2, 2020

The search in most application is an anti-feature.

durnygbur · on Dec 2, 2020

No ripgrep-all through the package manager:

  $ sudo dnf install -y ripgrep-all
  [...]
  No match for argument: ripgrep-all
  Error: Unable to find a match: ripgrep-all

Rust's package manager fails:

  $ cargo install ripgrep_all
  [...]                                 
  failed to select a version for the requirement `cachedir = "^0.1.1"`
  candidate versions found which didn't match: 0.2.0
  location searched: crates.io index                                                                                                   
  required by package `ripgrep_all v0.9.6`

Quick search on the web shows that more people have problems with cachedir version.

ChrisSD · on Dec 2, 2020

It looks like cachedir yanked version 0.1.1. This is usually only done when a very serious issue is discovered, though I don't know what the reason is in this case.

https://crates.io/crates/cachedir

est31 · on Dec 2, 2020

You can do cargo install --locked ripgrep_all as a workaround. It uses the lockfile that's part of the ripgrep_all package, so you miss out on some package updates, but can also get the cachedir version required.

There is a github issue to make this the default behaviour of cargo, but you miss out on updates which might fix security bugs so the cargo team is unwilling to change the default.

https://github.com/rust-lang/cargo/issues/7169

akavel · on Dec 2, 2020

The "Integration with fzf" example looks really cool:

https://github.com/phiresky/ripgrep-all#integration-with-fzf

alexruf · on Dec 2, 2020

Idea behind Rga is cool. Anyway, I tried it on Mac and installed via Homebrew. The formula already says it depends on ripgrep (that's fine since I have ripgrep already installed and use it regularly). I still was surprised when I executed Rga for the first time and got an error message that 'pdftotext' was not found. Since pdftotext has been officially discontinued, I am not sure if I want to install an old version just to make Rga work on my machine. Don't think it's an good idea to rely on a project which is not maintained actively.

there_the_and · on Dec 2, 2020

I don't see any indication that pdftotext has been discontinued [1]. It looks like a Mac-specific installer available via Homebrew Cask has been discontinued [2], but pdftotext is still available through the normal poppler formula [3].

1. https://poppler.freedesktop.org/releases.html

2. https://formulae.brew.sh/cask/pdftotext

3. https://formulae.brew.sh/formula/poppler

burntsushi · on Dec 2, 2020

> Since pdftotext has been officially discontinued

Do you have a link for that? That's news to me.

alexruf · on Dec 2, 2020

brew info pdftotext

https://formulae.brew.sh/cask/pdftotext#default

burntsushi · on Dec 3, 2020

That looks like a problem with that specific package, and not pdftotext that is in poppler. I don't even know what that package is. It links to bluem.net?

phiresky · on Dec 2, 2020

Yeah, In my opinion poppler should be a dependency of rga in homebrew (since it's kinda useless without having the default adapters), but I don't maintain that package.

sundarurfriend · on Dec 3, 2020

^ this is the developer of rga, in case it's not clear

antegamisou · on Dec 2, 2020

I always found useful something along the lines of

  pdftotext -layout file.pdf | grep -E ...

for PDFs, good to see a Swiss Army knife utility for all sorts of file though!

phiresky · on Dec 2, 2020

rga uses pdftotext (from poppler) internally for pdfs, except wraps it in parallelization and a very fast cache layer, since you usually want to do multiple queries per file :)

lafrenierejm · on Dec 3, 2020

If anyone is interested gron [0], I have an open PR [1] to add it as an adapter to ripgrep-all. The patch was based on the most recent release, since master is currently not functional.

0: https://github.com/TomNomNom/gron

1: https://github.com/phiresky/ripgrep-all/pull/77

faitswulff · on Dec 2, 2020

I noticed that you can use Tesseract as an OCR adapter for rga. Tesseract is written in python, IIRC, and in the OP it comes with a warning that it’s slow and not enabled by default. Are there any other fast, reliable OCR libs out there? Or any rust OCR backends?

mouldysammich · on Dec 2, 2020

https://github.com/tesseract-ocr/tesseract seems to be written in c++ not python

faitswulff · on Dec 2, 2020

Ah, my mistake then.

hobofan · on Dec 2, 2020

I don't think the problem necessarily is that Tesseract is slow, but that the whole process of rendering a PDF to a series of PNGs on which you can then run OCR is slow (which is what it does in the background).

undebuggable · on Dec 2, 2020

The process of converting all pages to raster images and then OCR-ing each one takes hours for PDFs hundreds of pages long. This workflow is not suitable for instant search. For non OCR-ed PDFs it's worth to pregenerate the text.

hobofan · on Dec 2, 2020

That's why rga comes with a cache. I've occasionally used the Tesseract adapter with good success (results-wise), and after the inital rendering and indexing, it's fast enough to use.

soferio · on Dec 2, 2020

Can it (or any tool) perform proximity searches on scanned PDFs? E.g word1 within 20 words of word2, on scanned PDFs? (I think this is non trivial but very useful.)

phiresky · on Dec 2, 2020

Scanned PDFs only work well if they already have an OCR layer. There's some optional integration of rga with tesseract, but it's pretty slow and less good than external OCR tools.

ripgrep-all can do the same regexes as rg on any filetypes it supports. So you can could do something like --multiline and foo(\w+[\s\n]+){,20}bar

It won't work exactly like this, but something similar should do it:

--multiline enables multiline matching

* foo searches for foo

* \w+ searches for at least one word character

* [\W]+ searches for at least one space/nonword character like sentence marks

* {,20} searches for at most 20 iterations of the word-space combination bar searches for bar

ballmerspeak · on Dec 2, 2020

If its a scanned PDF (essentially a collection of 1 image per page), there would need to be an OCR step to get some text out first. Tesseract would work for this.

Once that's done, you have all the options available to perform that search. But I don't know of a search tool that does the OCR for you. I did read a blog post of someone uploading PDFs to google drive (they OCR them on upload) as an easy way to do this.

supernova87a · on Dec 2, 2020

For PDFs, how does it (does it?) deal with for example, when phrases get ripped apart by the layout? Like if you search for a multiple word phrase, it's often foiled by word wrap or being in a table.

diimdeep · on Dec 2, 2020

Is anyone preferring some other search tool other than Spotlight on macOS ?

cpach · on Dec 2, 2020

I like Spotlight, and its CLI companion mdfind.

qppo · on Dec 2, 2020

I use ripgrep on macos

michaelcampbell · on Dec 2, 2020

ripgrep in emacs for me.

fock · on Dec 2, 2020

can it produce links to open the file yet (don't know rust, so can't add a PR easily). At least gnome-terminal supports that (and normally it should also support opening a specific pdf page)!

mcintyre1994 · on Dec 2, 2020

Not sure if the implementation is in rg or zsh, but that combination produces cmd-clickable file names for me.

dang · on Dec 2, 2020

If curious see also

2019 https://news.ycombinator.com/item?id=20196982

maxioatic · on Dec 2, 2020

This is great. I have 100+ ebooks/pdfs of programming and textbooks of which I've been extracting the index pages of. My intention was to always make some sort of search index out of them. I will definitely be trialing this (initial few searches seem promising!)

skanga · on Dec 2, 2020

Great idea. Please update on whether this use case works or not! And other tips, examples, etc.

chris_st · on Dec 2, 2020

Curious why this isn't a pull request to ripgrep? Maybe it was, and rejected? It'd be nice to just have one tool, and this doesn't feel like it's a stretch to add to ripgrep.

burntsushi · on Dec 2, 2020

It's a stretch. A big one.

I answered this a while back: https://old.reddit.com/r/rust/comments/c1bjw4/rga_ripgrep_bu...

SamuelAdams · on Dec 2, 2020

Any advantages to this over something like Agent Ransack?

https://www.mythicsoft.com/agentransack/

fnord123 · on Dec 2, 2020

Works on non-Windows. ripgrep is notoriously fast. Command line interface. Not comically priced at 59.95 USD.

cpach · on Dec 2, 2020

Why is there an expectation that every application should be free or cheap? IMHO $60 is very reasonable for a program that can save a lot of time for the user. And developers also have to eat, and might want to some day retire.

fnord123 · on Dec 2, 2020

I'm not the boss of you. If you want to spend 60 USD on a program that is mostly built into Finder and Nautilus, fill your boots.

michaelcampbell · on Dec 2, 2020

$60 may not be much for you; it may be worth it for you. For others it may not be.

tfigment · on Dec 2, 2020

Command line, linux, and open source immediately come to mind

cb321 · on Dec 2, 2020

NOTE: ripgrep already has --pre. (No pre-built indexing, of course.)

burntsushi · on Dec 2, 2020

That's exactly what ripgrep-all uses to implement this. There's a lot of integration work required to make this nice. The --pre flag is just a small hook. More info on it here: https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md#p...

cb321 · on Dec 2, 2020

Yup.

Something perhaps more helpful but so far unmentioned (and somewhat OS-specific) is that statically linked executables usually fork & exec (especially exec) much faster than dynamically linked ones. This difference is usually only like 50..150 us vs 500..3000 us but can multiply up over thousands of files.

This only matters on the first run of `rga`, of course. While the dispatched-to decoder is likely mostly out of one's linking control, this overhead can be saved for the dispatcher, at least. So, I would suggest `rga-preproc` should have a static linking option/suggestion, at least on Linux.

Of course, this overhead may also fall below the noise of PDF/ebook/etc. parsing, but maybe not the decompression of small files in some dark horse format. :-)

hiq · on Dec 2, 2020

It would be nice to have a direct comparison with ugrep. In the case of rg the benchmarks are already enough to switch. Why should I use rga instead of ugrep?

burntsushi · on Dec 2, 2020

I've called the ugrep benchmarks into question, and I elaborated on it here (and this includes a frustrating exchange between myself and the ugrep author): https://old.reddit.com/r/rust/comments/i6pfb2/ugrep_new_ultr...

I've also re-run my original set of benchmarks[1] with ugrep included: https://github.com/BurntSushi/ripgrep/blob/master/benchsuite...

[1] - https://blog.burntsushi.net/ripgrep/

hiq · on Dec 2, 2020

Just to be clear, I meant that I had switched to ripgrep because its speed was convincing enough on its own (so I did not even extra features to switch).

I'm currently not using any of ugrep or rga, although I have used pdfgrep in the past. It'd be nice for casual users like me to know more about why I should use rga over ugrep (or vice-versa).

patricktlo · on Dec 3, 2020

Thanks! This is a godsend for someone like me who needs to search through many PDFs/docx documents to find information for work!

edm0nd · on Dec 2, 2020

Big fan of ripgrep. Use it on Windows to search through 100s of GBs of data really quickly.

CamperBob2 · on Dec 2, 2020

Do you have a version that works with DOS pipes? I have to resort to an older grep version if I need to do something like:

    dir *.c* | rg somesubstringinfilename

lenkite · on Dec 3, 2020

Is there a way to alias adapters ? So that .jar and .esa can be used for .zip ?

nikisweeting · on Dec 2, 2020

Aww hell yeah we should definitely use this in place of ripgrep for the new ArchiveBox.io full-text search backend.

https://github.com/ArchiveBox/ArchiveBox/pull/543

root_axis · on Dec 2, 2020

This is really great.

kovek · on Dec 2, 2020

How could I use Rga to search my browsing history?

simonw · on Dec 2, 2020

Your browser history (if you use Chrome or Firefox at least) is stored in a SQLite database.

It looks like rga can handle SQLite out of the box, so just making sure your history .db file is visible to rga may be all you need.

You can also use my Datasette tool to get a web UI against your history, see https://docs.datasette.io/en/stable/getting_started.html#usi...

kovek · on Dec 3, 2020

Are there tools to store the all history webpage contents to be viewed later?

vmchale · on Dec 2, 2020

Wonderful! pdfgrep is good but slow.

phiresky · on Dec 2, 2020

pdfgrep has a --cache option since a while ago :) Not sure why they don't enable it by default. Still, this is much faster.

0df8dkdf · on Dec 2, 2020

Great tool!!!

gopty · on Dec 2, 2020

Sounds like a poor man's version of recoll

https://www.lesbonscomptes.com/recoll/

A PDF in a Zip file, in an email attachment. recoll can index it and do OCR if you like

globular-toast · on Dec 2, 2020

I have mixed feelings about these kinds of tools.

I can understand it might be nice to have a personal library of PDF books and searching in them. I can't think of a time I've ever wished I could search my bookshelf in that way, but you never know.

Obviously I use tools like ripgrep for searching codebases and the like.

But the extreme flexibility of this one in particular (and others like MacOs Spotlight) makes it seem more like a data recovery tool for me. If my directory structures and databases ever completely failed for some reason I might need to search through everything to find the data again. It's good to know such tools exist, I suppose.

But my fear is that tools like this teach people to not worry about organisation of data and to just fill up their disks with no structure at all. I think that unless something goes terribly wrong nobody should ever need a tool like this. Once you rely on it, you're out of luck it if it ever fails you. What if you just can't remember a single searchable phrase from some document, but you just know it must exist somewhere?

It's similar to what Google has done to the web. When I was growing up it used to be a skill to use the web. People used tools like bookmarks and followed links from one place to another. Now it's just type it into Google and if Google doesn't know, it doesn't exist.

nojito · on Dec 2, 2020

Hierarchal organizing of data is not a productive way of organization. Simply due to how much information people accumulate and often times structures breakdown.

It's more intuitive to simple search for something in the something you are looking for and clicking it.

I haven't used a folder organization structure in many many years. Other than the defaults for my cloud folders and a separation between Personal + Work.

armoredkitten · on Dec 2, 2020

I mean, I understand what you mean when it comes to Google -- the web essentially becomes locked into a particular proprietary solution to finding information. I definitely still have hundreds (maybe into the thousands?) of bookmarks of sites that store information I care about.

But I don't think this tool deserves the same sort of mixed feelings. I don't think this replaces structure -- there's still value to having a conceptual mapping of where documents are stored, and for grouping sets of documents together. It's just that having a structure doesn't help if you don't know where in the structure something is stored. This sort of tool is a bottom-up approach for the times when the top-down approach doesn't work very well.

Do you have similarly mixed feelings if sometimes, even with my carefully-crafted set of bookmarks with all their nested folders, I use the search tool to find the bookmark I'm looking for? It's the same idea. Sometimes a top-down structure is beneficial. But sometimes things get misclassified, or you forget about some piece of the structure, or you aren't familiar with some new structure, and in those cases, having bottom-up tools are immensely useful. There's no risk of vendor lock-in here. It's just a difference of approach in information retrieval.

durnygbur · on Dec 2, 2020

There is nothing wrong with the original Google's postulate. Your local search results are less likely to be hijacked by entities bidding for your attention. I agree with the argument for organizing the data anyway.