Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We're approaching the point where OCR becomes "solved" — very exciting! Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.

However IMO, there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. But the future is on the horizon!

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)




One problem I’ve encountered at my small startup in evaluating OCR technologies is precisely convincing stakeholders that the “human-in-the-loop” part is both unavoidable, and ultimately beneficial.

PMs want to hear that an OCR solution will be fully automated out-of-the-box. My gut says that anything offering that is snake-oil, and I try to convey that the OCR solution they want is possible, but if you are unwilling to pay the tuning cost, it’s going to flop out of the gate. At that point they lose interest and move on to other priorities.


Yup definitely, and this is exactly why I built my startup. I've heard this a bunch across startups & large enterprises that we work with. 100% automation is an impossible target, because even humans are not 100% perfect. So how we can expect LLMs to be?

But that doesn't mean you have to abandon the effort. You can still definitely achieve production-grade accuracy! It just requires having the right tooling in place, which reduces the upfront tuning cost. We typically see folks get there on the order of days or 1-2 weeks (it doesn't necessarily need to take months).


It really depends on their fault tolerance. I think there's a ton of useful applications where OCR would be 99.9%, 99%, and even 98% reliable. Skillful product manager can keep these limitations in mind and work around them.


... unavoidable "human in the loop" - depends imo.

From the comments here, it certainly seems that for general OCR it's not up to snuff yet. Luckily, I don't have great ambitions.

I can see this working for me with just a little care upfront preprocessing now that I know where it falls over. It casually skips portions of the document, and misses certain lines consistently. Knowing that I can do a bit massaging, and feed it what I know it likes, and then reassemble.

I found in testing that it failed consistently at certain parts, but where it worked, it worked extremely well in contrast to other methods/services that I've been using.


>> Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.

-OR- they can just use these APIs, and considering that they have a client base - which would prefer to not rewrite integrations to get the same result - they can get rid of most code base, replace it with llm api and increase margins by 90% and enjoy good life.


They're going to become commoditized unless they add value elsewhere. Good news for customers.


They are (or at least could easily be) adding value in form of SLA - charging money for giving guarantees on accuracy. This is both better for customer, who gets concrete guarantees and someone to shift liability to, and for the vendor, that can focus on creating techniques and systems for getting that extra % of reliability out of the LLM OCR process.

All of the above are things companies - particularly larger ones - are happy to pay for, because ORC is just a cog in the machine, and this makes it more reliable and predictable.

On top of the above, there are auxiliary value-adds such a vendor could provide - such as, being fully compliant with every EU directive and regulation that's in power, or about to be. There's plenty of those, they overlap, and no one wants to deal with it if they can outsource it to someone who already figured it out.

(And, again, will take the blame for fuckups. Being a liability sink is always a huge value-add, in any industry.)


The challenge I have is how to get bounding boxes for the OCR, for things like redaction/de-identification.


AWS Textract works pretty well for this and is much cheaper than running LLMs.


Textract is more expensive than this (for your first 1M pages per month at least) and significantly more than something like Gemini Flash. I agree it works pretty well though - definitely better than any of the open source pure OCR solutions I've tried.


yeah that's a fun challenge — what we've seen work well is a system that forces the LLM to generate citations for all extracted data, map that back to the original OCR content, and then generate bounding boxes that way. Tons of edge cases for sure that we've built a suite of heuristics for over time, but overall works really well.


Why would you do this and not use Textract?


I too have this question.


An LLM with billions of parameters for extracting text from a PDF (which isn't even a rasterized image) really does not "solve OCR".


Your customer includes Checkr? Impressive. Are they referencable?


btw - what 'dark patterns' does portkey contain?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: