|
|
| | Ask HN: OCR framework for extracting formatted text? | | 146 points by crocodiletears on May 22, 2020 | hide | past | favorite | 42 comments | | I'm a serial information hoarder, and often use screenshots in order to store comments, passages and fragments of conversations I find useful or insightful. This works well if I want to reference something recent, but obviously doesn't scale well. I'd like to integrate these into my personal archive, but don't know any frameworks (preferably for Go, Node, or Python) which could automatically extract the text from the images while retaining its formatting. I'm not against doing some image preprocessing myself, but I don't feel comfortable passing the images to a 3rd party service, since a portion of the images contain private or sensitive information that I can't readily sort out of my collection. |
|

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
|
I use my shell utility[3] to automate the workflow with ImageMagick and Tesseract, with intermediate step using monochrome TIFFs. Extracting each page into separate text file allows to ag/grep a phrase and then find it easily back in the original PDF.
Having greppable libraries of books on various domains and not having to crawl through the web search each time is very useful and time-saving.
[1] https://tesseract-ocr.github.io/
[2] https://github.com/tesseract-ocr/tessdata
[3] https://github.com/undebuggable/pdf2txt