Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> This is perfect!

Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”. These are https://fr.wikipedia.org/wiki/Adverbe_ordinal#Premiers_adver....

There’s also extra spaces after the “1607” and around the hyphen in “Diane-Henriette”.

Lastly, U+2019 instead of U+0027 would be more appropriate for the apostrophe, all the more since in the image it looks like the former and not like the latter.



Slightly unrelated, but I once used Apple’s built-in OCR feature LiveText to copy a short string out of an image. It appeared to work, but I later realized it had copied “M” as U+041C (Cyrillic Capital Letter Em), causing a regex to fail to match. OCR giving identical characters is only good enough until it’s not.


> Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”

Or degree symbol. Although it should be able to figure out which to use according to the context.


This is "reasoning model" stuff even for humans :).


There is OCR software that analyses which language is used, and then applies heuristics for the recognized language to steer the character recognition in terms of character sequence likelihoods and punctuation rules.

I don’t think you need a reasoning model for that, just better training; although conversely a reasoning model should hopefully notice the errors — though LLM tokenization might still throw a wrench into that.


It feels like, after the OCR step there should be language and subject matter detection, with a final sweep with a spelling / grammar checker that has the right "dictionary" selected. (That, right there, is my naivety on the subject, but I would have thought that the type of problem you're describing isn't OCR but classical spelling and grammar checking?)


It’s OCR because the wrong characters are being recognized. This is not about fixing spelling or punctuation mistakes present in the source image, it’s that errors are being introduced, due to a lack of accuracy of this OCR with regard to punctuation and typography. The punctuation errors are not different in principle from the case of the OCR producing a misspelled word that wasn’t misspelled in the image being OCRed.

A subsequent cleanup pass that fixes grammar/spelling errors, as you propose, wouldn’t be appropriate when the goal is to faithfully reproduce the original text.

And specifically for the “white circle” character, it would be difficult to correctly infer the original ordinal markers after the fact. I myself could only do so by inspecting the original image, i.e. by having my brain redo the OCR.


> A subsequent cleanup pass that fixes grammar/spelling errors, as you propose, wouldn’t be appropriate when the goal is to faithfully reproduce the original text

I suppose that depends on why it's wrong. Did the model accurately read a real typo in the image or did it incorrectly decipher a character? If a spelling & grammar pass fixes the latter, isn't it valid?


Not unrelated - OneNote 'copy text from image' has started producing lots of incorrect OCR results, but they're all non-words.

For example, from a clear image of a printed page (in a standard font), it will give me 'cornprising' instead of 'comprising'; 'niatter' instead of 'matter'. Excepting the spell-check underline they'd be hard to spot as with relatively tight kerning all the errors look like the originals.

I'm surprised as 1) I've not had these sorts of errors before, 2) they're not words, and words must be heavily weighted for in the OCR engine (I'd have thought).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: