But what's the need exactly for OCR when you have multimodal LLMs that can read ...

simonw · 2025-03-06T20:10:43 1741291843

The biggest risk of vision LLMs for OCR is that they might accidentally follow instructions is the text that they are meant to be processing.

(I asked Mistral if their OCR system was vulnerable to this and they said "should be robust, but curious to see if you find any fun examples" - https://twitter.com/simonw/status/1897713755741368434 and https://twitter.com/sophiamyang/status/1897719199595720722 )

pilooch · 2025-03-06T22:01:11 1741298471

Fun, but LLMs would follow them post OCR anyways ;)

I see OCR much like phonemes in speech, once you have end to end systems, they become latent constructs from the past.

And that is actually good, more code going into models instead.

troyvit · 2025-03-06T21:06:48 1741295208

Getting PDFs into #$@ Confluence apparently. Just had to do this and Mistral saved me a ton of hassle compared to this: https://community.atlassian.com/forums/Confluence-questions/...

daemonologist · 2025-03-06T19:57:09 1741291029

It's useful to have the plain text down the line for operations not involving a language model (e.g. search). Also if you have a bunch of prompts you want to run it's potentially cheaper, although perhaps less accurate, to run the OCR once and save yourself some tokens or even use a smaller model for subsequent prompts.

ks2048 · 2025-03-06T20:10:18 1741291818

Tons of uses: Storage (text instead of images), search (user typing in a text box and you want instant retrieval from a dataset), etc. And costs: run on images once - then the rest of your queries will only need to run on text.