Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But what's the need exactly for OCR when you have multimodal LLMs that can read the same info and directly answer any questions about it ?

For a VLLM, my understanding is that OCR corresponds to a sub-field of questions, of the type 'read exactly what's written in this document'.




The biggest risk of vision LLMs for OCR is that they might accidentally follow instructions is the text that they are meant to be processing.

(I asked Mistral if their OCR system was vulnerable to this and they said "should be robust, but curious to see if you find any fun examples" - https://twitter.com/simonw/status/1897713755741368434 and https://twitter.com/sophiamyang/status/1897719199595720722 )


Fun, but LLMs would follow them post OCR anyways ;)

I see OCR much like phonemes in speech, once you have end to end systems, they become latent constructs from the past.

And that is actually good, more code going into models instead.


Getting PDFs into #$@ Confluence apparently. Just had to do this and Mistral saved me a ton of hassle compared to this: https://community.atlassian.com/forums/Confluence-questions/...


It's useful to have the plain text down the line for operations not involving a language model (e.g. search). Also if you have a bunch of prompts you want to run it's potentially cheaper, although perhaps less accurate, to run the OCR once and save yourself some tokens or even use a smaller model for subsequent prompts.


Tons of uses: Storage (text instead of images), search (user typing in a text box and you want instant retrieval from a dataset), etc. And costs: run on images once - then the rest of your queries will only need to run on text.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: