> It takes images and PDFs as input If you are working with PDF, I would suggest...

themanmaran · 2025-03-06T18:43:23 1741286603

> Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.

It's always safer to OCR on every file. Sometimes you'll have a "clean" pdf that has a screenshot of an Excel table. Or a scanned image that has already been OCR'd by a lower quality tool (like the built in Adobe OCR). And if you rely on this you're going to get pretty unpredictable results.

It's way easier (and more standardized) to run OCR on every file, rather than trying to guess at the contents based on the metadata.

bob1029 · 2025-03-06T19:01:03 1741287663

It's not guessing if the form is known and you can read the information directly.

This is a common scenario at many banks. You can expect nearly perfect metadata for anything pushed into their document storage system within the last decade.

themanmaran · 2025-03-06T20:07:14 1741291634

Oh yea if the form is known and standardized everything is a lot easier.

But we work with banks on our side, and one of the most common scenarios is customers uploading financials/bills/statements from 1000's of different providers. In which case it's impossible to know every format in advance.