We're approaching the point where OCR becomes "solved" — very exciting! Any lega...

dml2135 · 2025-03-06T20:14:54 1741292094

One problem I’ve encountered at my small startup in evaluating OCR technologies is precisely convincing stakeholders that the “human-in-the-loop” part is both unavoidable, and ultimately beneficial.

PMs want to hear that an OCR solution will be fully automated out-of-the-box. My gut says that anything offering that is snake-oil, and I try to convey that the OCR solution they want is possible, but if you are unwilling to pay the tuning cost, it’s going to flop out of the gate. At that point they lose interest and move on to other priorities.

kbyatnal · 2025-03-06T20:37:34 1741293454

Yup definitely, and this is exactly why I built my startup. I've heard this a bunch across startups & large enterprises that we work with. 100% automation is an impossible target, because even humans are not 100% perfect. So how we can expect LLMs to be?

But that doesn't mean you have to abandon the effort. You can still definitely achieve production-grade accuracy! It just requires having the right tooling in place, which reduces the upfront tuning cost. We typically see folks get there on the order of days or 1-2 weeks (it doesn't necessarily need to take months).

golergka · 2025-03-07T00:38:00 1741307880

It really depends on their fault tolerance. I think there's a ton of useful applications where OCR would be 99.9%, 99%, and even 98% reliable. Skillful product manager can keep these limitations in mind and work around them.

jocoda · 2025-03-07T07:58:31 1741334311

... unavoidable "human in the loop" - depends imo.

From the comments here, it certainly seems that for general OCR it's not up to snuff yet. Luckily, I don't have great ambitions.

I can see this working for me with just a little care upfront preprocessing now that I know where it falls over. It casually skips portions of the document, and misses certain lines consistently. Knowing that I can do a bit massaging, and feed it what I know it likes, and then reassemble.

I found in testing that it failed consistently at certain parts, but where it worked, it worked extremely well in contrast to other methods/services that I've been using.

risyachka · 2025-03-06T20:08:44 1741291724

>> Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.

-OR- they can just use these APIs, and considering that they have a client base - which would prefer to not rewrite integrations to get the same result - they can get rid of most code base, replace it with llm api and increase margins by 90% and enjoy good life.

esafak · 2025-03-06T21:55:23 1741298123

They're going to become commoditized unless they add value elsewhere. Good news for customers.

TeMPOraL · 2025-03-06T22:53:16 1741301596

They are (or at least could easily be) adding value in form of SLA - charging money for giving guarantees on accuracy. This is both better for customer, who gets concrete guarantees and someone to shift liability to, and for the vendor, that can focus on creating techniques and systems for getting that extra % of reliability out of the LLM OCR process.

All of the above are things companies - particularly larger ones - are happy to pay for, because ORC is just a cog in the machine, and this makes it more reliable and predictable.

On top of the above, there are auxiliary value-adds such a vendor could provide - such as, being fully compliant with every EU directive and regulation that's in power, or about to be. There's plenty of those, they overlap, and no one wants to deal with it if they can outsource it to someone who already figured it out.

(And, again, will take the blame for fuckups. Being a liability sink is always a huge value-add, in any industry.)

techwizrd · 2025-03-06T20:47:09 1741294029

The challenge I have is how to get bounding boxes for the OCR, for things like redaction/de-identification.

dontlikeyoueith · 2025-03-06T22:24:03 1741299843

AWS Textract works pretty well for this and is much cheaper than running LLMs.

daemonologist · 2025-03-06T22:48:09 1741301289

Textract is more expensive than this (for your first 1M pages per month at least) and significantly more than something like Gemini Flash. I agree it works pretty well though - definitely better than any of the open source pure OCR solutions I've tried.

kbyatnal · 2025-03-06T21:18:22 1741295902

yeah that's a fun challenge — what we've seen work well is a system that forces the LLM to generate citations for all extracted data, map that back to the original OCR content, and then generate bounding boxes that way. Tons of edge cases for sure that we've built a suite of heuristics for over time, but overall works really well.

dontlikeyoueith · 2025-03-06T22:24:38 1741299878

Why would you do this and not use Textract?

schcrosby · 2025-03-07T00:04:05 1741305845

I too have this question.

einpoklum · 2025-03-06T22:30:17 1741300217

An LLM with billions of parameters for extracting text from a PDF (which isn't even a rasterized image) really does not "solve OCR".

nextworddev · 2025-03-06T23:49:53 1741304993

Your customer includes Checkr? Impressive. Are they referencable?

wahnfrieden · 2025-03-08T09:38:26 1741426706

btw - what 'dark patterns' does portkey contain?