Ask HN: OCR framework for extracting formatted text?

undebuggable · on May 22, 2020

To extract text from photos and non OCR-ed PDFs Tesseract[1] with language specific model[2] never fails me.

I use my shell utility[3] to automate the workflow with ImageMagick and Tesseract, with intermediate step using monochrome TIFFs. Extracting each page into separate text file allows to ag/grep a phrase and then find it easily back in the original PDF.

Having greppable libraries of books on various domains and not having to crawl through the web search each time is very useful and time-saving.

[1] https://tesseract-ocr.github.io/

[2] https://github.com/tesseract-ocr/tessdata

[3] https://github.com/undebuggable/pdf2txt

jdc · on May 22, 2020

If you're interested in grepping PDFs (among other formats) another option is ripgrep-all.

https://github.com/phiresky/ripgrep-all

Jaruzel · on May 22, 2020

I struggled to get tesseract to OCR my image based PDFs directly, so resorted to using GhostScript to extract the pages to pngs which I then put through tesseract. As an added bonus though, I gained the ability to have a thumbnail png for the search front end.

undebuggable · on May 23, 2020

> so resorted to using GhostScript to extract the pages to pngs which I then put through tesseract

I understand you use this to extract text from non OCR-ed PDFs, especially consisting of low quality scans or photos (e.g. low resolution, JPEG artifacts).

Ocassionally passing higher resolution to ImageMagick when converting a page to TIFF helped, but this sounds like a reasonable fallback as well.

bufferoverflow · on May 22, 2020

I tried using Tesseract to OCR just some numbers on an almost plain background, and it failed around 2-3% of the time. Which made the whole thing useless, because I needed 100% correctness.

for_your_info · on May 22, 2020

Your bash script is totally broken. It doesn't properly parse command line arguments. Ignores the page range set on -t -f

undebuggable · on May 22, 2020

This should work now, thanks.

ryanfox · on May 22, 2020

I built an application for exactly this. It's called A Personal Search Engine, APSE for short.[0]

It OCRs screenshots and stores the text in a search index, so you can query by keyword, date, boolean operators, the whole shebang.

It's all local. It is really useful for me - yesterday it saved me after Firefox wigged out and lost all my tabs. It's in a great place to try out, and I am actively developing it.

[0] https://apse.io

pezo1919 · on May 22, 2020

That's cool. How can I make sure it does not send my data somewhere? :)

ryanfox · on May 22, 2020

Thanks!

You could block it at the firewall - same as you would for any application.

Jaruzel · on May 22, 2020

Ahh the paranoia of HN strikes again.

Maybe, stop and think before you ask this. Someone offers up an example of their hard work and you instantly accuse them of being a malware author that steals your data. Nice.

stragies · on May 22, 2020

I would rather call it the new-found paranoia most legal departments now have, when IT dept. mentions rolling out new, unknown, non-auditable software in the company from a new vendor that wasn't grand-fathered in (= the legacy windows/cisco deployments). I'm happy about it. I'm just waiting for them to forbid closed-source firmwares/hardware.

asguy · on May 22, 2020

ocrmypdf and friends.

I've built an archival system based around Tesseract and PostgreSQL. It takes Images/PDFs, either scanned or generated, and rebuilds them as searchable PDFs before being extracted and inserted into Postgres' full-text search. I keep all of the original media because disk is cheap.

Originally I used Tesseract directly. But I found that ocrmypdf did a better job than my home-grown pipeline, so I switched.

tuddman · on May 22, 2020

I also built a system that extracted structured and unstructured text from images/pdfs. For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy. But never 100%. Combining pdftotext (with the right flags set) with some of the other associated pdf-tools, we were able to achieve what we were after: Building a searchable DB and auto-informing corpus of information derived entirely from various pdf sources. All in-house. No sending off to 3rd parties.

undebuggable · on May 22, 2020

> For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy.

Arrived to a similar conclusion although never have bothered with DB or any web interface running locally. Simply grepping the text files works flawlessly for me.

thibautg · on May 22, 2020

Cool I've done exactly the same with ocrmypdf! With a Django web app to search through the scanned documents (around 30k documents and 200k pages).

Ologn · on May 22, 2020

Tesseract is the best FOSS one I found when I looked a bit back. I don't conceive of a superior FOSS one any time soon unless a commercial one open sources, or some one utilizing deep learning comes out.

kranner · on May 22, 2020

Tesseract 4 has switched to using a deep learning engine by default.

crocodiletears · on May 22, 2020

Very nice. This is quite similar to what I've been building for myself.

tyingq · on May 22, 2020

Here's a blog post showing self hosted PyTesseract finding text in an image and preserving the format: https://stackabuse.com/pytesseract-simple-python-optical-cha...

There's a reason why the external services are popular though...lots of training data and tweaks to make them much more accurate. Try the Google demo here, for example: https://cloud.google.com/vision/docs/ocr

flicken · on May 22, 2020

Although https://www.willus.com/k2pdfopt/ is meant for reformatting PDFs to view on e-readers, it does do a reasonable job of extracting text via OCR and storing as a PDF layer. The underlying engine can be either https://github.com/tesseract-ocr/tesseract or http://jocr.sourceforge.net/

faustomorales · on May 22, 2020

If you are (a) willing to take the word bounding boxes and convert them to paragraphs yourself, and (b) okay with a deep learning approach, you may want to give keras-ocr [0] a try.

Full disclosure: I'm the primary package developer. Shameless plug. :)

[0] https://github.com/faustomorales/keras-ocr

nacho_man · on May 22, 2020

This doesn't meet most your requirements, (Go, Node, Python, and it's a manual process...) but... maybe this would be helpful?

On Mac I use a modified version of this Keyboard Maestro script, to OCR a user selected area of the screen.

This script will result in the OCR Text on the clipboard. I'm sure Keyboard Maestro could automagically append it to a text file or something. I'm kinda a noob with Keyboard Maestro, so I don't know all of it's functionality.

I have a couple variations of this script, one that will use the Mac's speak this command to read aloud the OCR text, as I am a slow reader, and an auditory learner.

My father had a bunch of newspaper clippings scanned into the family tree application and wanted the text. I used this method to get the text instead of typing it all out.

https://forum.keyboardmaestro.com/t/ocr-user-selected-area-m...

kamalfariz · on May 22, 2020

OCR techniques are general purpose in trying to map any conceivable text-looking shapes into actual text. Accuracy can vary wildly but the good ones will match against plausible words to eliminate low quality guesses.

Is there an accuracy optimization to be found if I can pre-train the OCR engine to look for a limited set of words instead of the entire dictionary- and printable character space?

The use case I have is OCRing shipping labels for packages that arrive at an office. The set of plausible matches is incredibly small as it is the set of employee names that work in said office.

Further optimizations include reducing the problem space by only considering computer printed glyphs and not bothering with handwritten labels, and the insight that the distribution of packages follow a power law where a disproportionately small group of people receive the largest number of packages.

The end goal is to perform this entirely on device, with low latency and high accuracy.

hsson · on May 22, 2020

Consider looking into language models such as KenLM. It is used by ASR models like wav2letter and DeepSpeech to correct speech-to-text transcripts

kranner · on May 22, 2020

Try https://screenotate.com/

(no affiliation, just a user)

inetsee · on May 22, 2020

One problem that I have with OCR is dealing with images of pages that are warped. I have some books that I would like to turn into electronic books, but not enough to justify setting up a book scanning rig (framework, two cameras, platen, etc). Setting up a document camera is fairly easy, but using it to take pictures of a book laying flat on the base produces images where the pages are warped and most OCR software seems to have problems with warped pages.

After a fair amount of searching I found ScanTailor: https://github.com/4lex4/scantailor-advanced#scan-tailor-adv... which seems to have the capability of dealing with warped page images. I haven't actually gone through the complete workflow with it yet, but it seems to be a very capable OCR package.

umvi · on May 22, 2020

I used this[0] in conjunction with Tesseract, and it worked pretty well.

[0] https://github.com/mzucker/page_dewarp

inetsee · on May 22, 2020

Thank you. This does look like it has an easier workflow than ScanTailor. I'll have to give it a try.

coderguy123 · on May 22, 2020

I hate to post my own app but it does do part of what you ask and it does it locally. Nothing is sent to any server.

https://www.dizzybits.com/Photoplex

It does on-device text recognition on your photos, stores on local SQLite database and lets you full text search.

Jugurtha · on May 22, 2020

Site: https://openpaper.work/

Repo: https://gitlab.gnome.org/World/OpenPaperwork/paperwork

Brainsnail · on May 22, 2020

https://github.com/axa-group/Parsr

FloatArtifact · on May 22, 2020

I'm interested in drawing bounding boxes around text that can be displayed to the end user. In this way I don't care about OCR accuracy but the ability detect text accurately and across different mediums of type. Thoughts for a framework for this that's low latency under 150 ms or so?

jangia · on May 22, 2020

You may set up your OCR service on AWS Lambda.

I wrote a guide how to do it here:

https://typless.com/2020/05/21/tesseract-on-aws-lambda-ocr-a...

Hope it helps

cl0rkster · on May 22, 2020

just search for "tesseract GUI". if you are more technical, you can write code around tesseract. for what you get for free, it's really impressive what Google has done with this in just a few years to make it something that the average person can really consider using for free.

ex. https://github.com/tesseract4java/tesseract4java

misiti3780 · on May 22, 2020

I know you said you didnt want to upload stuff to third parties but Amazon Textract works great and supports HIPPA data

crocodiletears · on May 22, 2020

Plenty of fantastic suggestions in the comments, any one of which looks like it could do the trick. Not having any experience in the problem domain, I'm afraid I don't have much to contribute in response, but I look forward to evaluating each framework/service.

lowdose · on May 22, 2020

Why not upload it to Google Photos. It will do the OCR and make the text on your photos / screenshots searchable with a sweet UI in the browser.

If you still want to grab the text yourself you make a copy to Google Keep and use the "grab text" function.

Works for me, I take full screenshots of interesting stuff so the url is still visible when I want to go back to the original.

Obviously I have a paid G Suite account at Google. That comes with a very good set of privacy protecting rules. Doesn't matter how you roll your stack eventually you are going to be dependent on a 3th party. Better use one that offers full encryption and 2FA to lockup your data.

https://gsuite.google.com/learn-more/security/security-white...

crocodiletears · on May 22, 2020

I've a number of screenshots concerning conversations, documents, and pii I don't necessarily trust in the hands of third parties, as well as don't feel I'm at liberty to share with third parties.

Beyond that, as exceptional as Gsuite is, I've been making a conscious effort to excise Alphabet/Google services from my life - it's just not a company I trust.

lowdose · on May 22, 2020

Isn't that data already in the hands of 3th parties when it are screenshots of conversations and documents, or did you also build that communication stack from the ground up?

crocodiletears · on May 22, 2020

I'd frame it like this:

With respect to online conversations - most of them are on the open-web, anyone can see them. I don't care if their content gets out. Private conversations should be kept between their participants, their host, and their host's infrastructure provider.

More saliently however, many of these screenshots contain incidental data which I wouldn't necessarily want to be centralized off of my own hardware. This ranges from the identities of multiple alt-accounts, who they follow on social media, to generic information about my social graph. They also include receipts of much of my online transaction history.

While I'm under no delusion that much of that data doesn't travel all over the universe via data brokers and information sharing agreements, I'm just not comfortable directly handing it all to any one company.

If I was working on a commercial project, I'd leap at the opportunity to outsource the task of content transcription - it would save me time, money, and quite probably give me better results.

But since I want to feed it all into my personal archive, which runs on my own hardware and is as much a learning project as it is a utility, and since I like to keep my personal life as personal as possible, I make a point of keeping everything self-hosted wherever possible.

I'll fully admit that it's paranoid, labor-intensive, likely ineffectual, and by most measures a bit excessive.

But there are few places where one is at liberty to draw a line in the sand anymore with how their data is distributed. This is simply where I've chosen to draw one of mine.

lowdose · on May 22, 2020

Look I fully agree with you if that is what you want, and you are fully aware of the trade-off you make.

When you pull this off you are a very talented skilled engineer. I hope you open source your solution so friction is removed for other people with a similar dilemma in the future.

Our time is the only currency we have and we can pursue activities we love or fear. The line between paranoia or choosing for personal freedom is thin and very personal.

I came to the conclusion for myself I have spend to much time on home grown solution for problems others have solved better and cheaper. Getting from it works 80% of the time to 99% and I can blindly trust my infra is the difference between a weekend and year fulltime work.

I choose for G Suite because at least Google offers me a paid option to exclude my account from their advertisement data monetizing branch.

I do really respect that you make a deliberate effort in this.