Thing is, the majority of OCR errors aren't character issues, but layout issues....

tensor · 2025-02-26T23:42:04 1740613324

Having experience in this area, audit, legal, confidence intervals are essential. No, you don't end up "passing every single document" to human review. That's made up nonsense. But confidence intervals can pretty easily flag poorly OCR'd documents, and then yes they are done by human review.

If you try to pitch hallucinations to these fields, they'll just choose 100% manual instead. It's a non-starter.

xattt · 2025-02-27T11:19:32 1740655172

I work in a health insurance adjacent field. I can see my work going the way of the dodo as soon as VLLs take off in interpreting historical health records with physicians’ handwriting.

gtirloni · 2025-02-27T14:48:32 1740667712

So never considering their handwriting :)

That being said, all doctors I have consulted with in the past year or so used signed electronic prescriptions.

anon373839 · 2025-02-27T04:09:40 1740629380

> But if you're not rejecting based on CI, then you're exposed to just as much risk as using an LLM.

That's not true. LLMs and OCR have very different failure modes. With LLMs, there is unbounded potential for hallucination, and the entire document is at risk. For example: if something in the lower right-hand corner of the page takes the model to a sparsely sampled part of the latent space, it can end up deciding that it makes sense to rewrite the document title! Or anything else. LLMs also have a pernicious habit of "helpfully" completing partial sentences that appear at the beginning or end of a page of text.

With OCR, errors are localized and have a greater chance of being detected when read.

I think for a lot of cases, the best solution is to fine-tune a model like LayoutLM, which can classify the actual text tokens in a document (whether obtained from OCR or a native text layer) using visual and spatial information. Then, there are no hallucinations and you can use uncertainty information from both the OCR (if used) and the text classification. But it does mean that you have to do the work of annotating data and training a model, rather than prompt engineering...

tensor · 2025-02-27T17:26:15 1740677175

100% this, combining traditional OCR with VLMs that can work with bounding boxes so that you can correlate the two is the way to go.

bayindirh · 2025-02-27T10:49:37 1740653377

The problem is, regardless of the confidence number, you can scan and mark document for grammatical errors.

In VLM/LLM powered methods, the missing/misred data will be hallucinated and you can't know whether something scanned correctly or not. I personally scan and OCR tons of personal documents, I prefer "gibberish" rather than "hallucinations", because they're easier to catch.

We had this problem before [0], on some Xerox scanners and copiers. Results will be disastrous. It's not a question of if, but when.

I personally tried Gemini and OpenAI's models for OCR, and no, I won't continue using them further.

[0]: https://www.theregister.com/2013/08/06/xerox_copier_flaw_mea...

rafram · 2025-02-27T04:13:29 1740629609

Then use an LLM to extract layout information. Don’t trust it to read the text.

> If the OCR model gives you back 500 words all ranging from 0.70 to 0.95 confidence, what do you do? Reject the entire document if there's a single value below 0.90?

No, of course not. You have a human review the words/segments with low confidence.

sudoshred · 2025-02-28T02:10:45 1740708645

That’s assuming that confidence intervals are even independently comparable. Anecdotally major OCR services with specific languages have average confidence intervals that are wildly divergent from similar services with different languages for the same relative quality of result. Acting as if confidence interval is in any way absolute or otherwise able to reliably and consistently indicate the relative quality of results is a mischaracterization at best. In the worst case CI is as good as an RNG. The value of the CI is in the ability to tune usage of the results based on observations of the users and characteristics of the request, sometimes it is meaningful but not always. In this case “good” code essentially hardcodes handling for all the idiosyncrasies of the common usage and the OCR service.