Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It takes images and PDFs as input

If you are working with PDF, I would suggest a hybrid process.

It is feasible to extract information with 100% accuracy from PDFs that were generated using the mappable acrofields approach. In many domains, you have a fixed set of forms you need to process and this can be leveraged to build a custom tool for extracting the data.

Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.

The moment you need to use this kind of technology you are in a completely different regime of what the business will (should) tolerate.




> Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.

It's always safer to OCR on every file. Sometimes you'll have a "clean" pdf that has a screenshot of an Excel table. Or a scanned image that has already been OCR'd by a lower quality tool (like the built in Adobe OCR). And if you rely on this you're going to get pretty unpredictable results.

It's way easier (and more standardized) to run OCR on every file, rather than trying to guess at the contents based on the metadata.


It's not guessing if the form is known and you can read the information directly.

This is a common scenario at many banks. You can expect nearly perfect metadata for anything pushed into their document storage system within the last decade.


Oh yea if the form is known and standardized everything is a lot easier.

But we work with banks on our side, and one of the most common scenarios is customers uploading financials/bills/statements from 1000's of different providers. In which case it's impossible to know every format in advance.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: