> This is perfect! Just a nit, but I wouldn’t call it perfect when using U+25CB ...

MatthiasPortzel · 2025-03-07T01:55:22 1741312522

Slightly unrelated, but I once used Apple’s built-in OCR feature LiveText to copy a short string out of an image. It appeared to work, but I later realized it had copied “M” as U+041C (Cyrillic Capital Letter Em), causing a regex to fail to match. OCR giving identical characters is only good enough until it’s not.

jorvi · 2025-03-06T23:22:22 1741303342

> Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”

Or degree symbol. Although it should be able to figure out which to use according to the context.

TeMPOraL · 2025-03-06T22:45:02 1741301102

This is "reasoning model" stuff even for humans :).

layer8 · 2025-03-06T22:50:53 1741301453

There is OCR software that analyses which language is used, and then applies heuristics for the recognized language to steer the character recognition in terms of character sequence likelihoods and punctuation rules.

I don’t think you need a reasoning model for that, just better training; although conversely a reasoning model should hopefully notice the errors — though LLM tokenization might still throw a wrench into that.

raffraffraff · 2025-03-07T08:33:03 1741336383

It feels like, after the OCR step there should be language and subject matter detection, with a final sweep with a spelling / grammar checker that has the right "dictionary" selected. (That, right there, is my naivety on the subject, but I would have thought that the type of problem you're describing isn't OCR but classical spelling and grammar checking?)

layer8 · 2025-03-07T13:02:47 1741352567

It’s OCR because the wrong characters are being recognized. This is not about fixing spelling or punctuation mistakes present in the source image, it’s that errors are being introduced, due to a lack of accuracy of this OCR with regard to punctuation and typography. The punctuation errors are not different in principle from the case of the OCR producing a misspelled word that wasn’t misspelled in the image being OCRed.

A subsequent cleanup pass that fixes grammar/spelling errors, as you propose, wouldn’t be appropriate when the goal is to faithfully reproduce the original text.

And specifically for the “white circle” character, it would be difficult to correctly infer the original ordinal markers after the fact. I myself could only do so by inspecting the original image, i.e. by having my brain redo the OCR.

raffraffraff · 2025-03-07T22:02:19 1741384939

> A subsequent cleanup pass that fixes grammar/spelling errors, as you propose, wouldn’t be appropriate when the goal is to faithfully reproduce the original text

I suppose that depends on why it's wrong. Did the model accurately read a real typo in the image or did it incorrectly decipher a character? If a spelling & grammar pass fixes the latter, isn't it valid?

pbhjpbhj · 2025-03-07T15:39:44 1741361984

Not unrelated - OneNote 'copy text from image' has started producing lots of incorrect OCR results, but they're all non-words.

For example, from a clear image of a printed page (in a standard font), it will give me 'cornprising' instead of 'comprising'; 'niatter' instead of 'matter'. Excepting the spell-check underline they'd be hard to spot as with relatively tight kerning all the errors look like the originals.

I'm surprised as 1) I've not had these sorts of errors before, 2) they're not words, and words must be heavily weighted for in the OCR engine (I'd have thought).