Auntie PDF – an open source app built using Mistral OCR

sbarre · 2025-03-08T04:25:11 1741407911

I find it challenging to accept something that talks about "OCR" but then I upload a PDF with text in images, and when I query the document after upload, I get a message that says "I can't interpret images"..

Then are you actually doing OCR, or are you just extracting embedded text?

0x62 · 2025-03-08T04:39:05 1741408745

I’d imagine their capabilities mirror that of Mistral OCR [1]. Mistral outputs markdown, the image would have to be convertible to a reasonably useful markdown structure (charts, tables etc).

[1] https://mistral.ai/en/news/mistral-ocr

themanmaran · 2025-03-08T04:48:42 1741409322

This highlights the biggest issue I've found with Mistral OCR. Many of the documents I upload are entirely classified as images, which means no OCR is being run.

Pretty much anything with a different colored background gets returned as (image)[image_001].

Example: https://omni-demo-data.s3.us-east-1.amazonaws.com/test/17398...

mtillman · 2025-03-08T05:19:25 1741411165

LLMs tend to be a hammer in search of a nail when it comes to documents that have imagery. We decided on CV models which results in a high 90s midpoint for the docs our customers care about. If you can afford to go with a cv pipeline, it can outperform all of the LLMs by some margin.

bilater · 2025-03-08T16:04:05 1741449845

Yes - unfortunately it seems they don't read images in the pdf.

tmpz22 · 2025-03-08T04:43:44 1741409024

Same vein - YouTube most (all?) llm integrations just scrape the transcript. I -think- google's aistudio does more but I'm unsure.

I mean I get it bulk video processing would be crazy expensive, but at least mention you're only analyzing the transcript especially if you're a paid product.

wildzzz · 2025-03-08T05:27:23 1741411643

Whisper does do text to speech but yes, nearly all just read off the subtitles. There's a video by f4mi on YouTube where she tricked the summarizing bots with off-screen captions filled with nonsense.

setnone · 2025-03-08T04:35:39 1741408539

Sweet branding! Grandma told me she's not happy with lack of privacy policy.

bilater · 2025-03-08T16:04:26 1741449866

Its open source.

simonw · 2025-03-08T05:42:55 1741412575

I built a CLI tool for experimenting with Mistral OCR here: https://simonwillison.net/2025/Mar/7/mistral-ocr/

Honestly, the vibes aren't great. Gemini is a lot more flexible for handling PDFs - you can prompt it to do a bunch of other things - and Mistral OCR appears to hallucinate if it can't correctly read handwriting, a common problem with vision LLM based OCR tools.

The way Mistral OCR handles images within the text is disappointing - it doesn't attempt to interpret them, just extracts them out as binary blobs. A vision LLM can usually do a great job of describing an image, but with Mistral OCR you have to manually run that as a separate step.

brianjking · 2025-03-08T15:07:04 1741446424

Knowing that you have to do that as a separate step adds a whole additional level of complexity too.

For example, if some content has the images and some don't, you need to add whole additional steps to your processing and potentially add hallucinations in.

What are you using for document extraction lately, Simon?

simonw · 2025-03-08T21:12:02 1741468322

I'm really impressed with Gemini - Gemini 2.0 Pro Exp seems remarkably good at even really complex scrappy documents.

bilater · 2025-03-08T16:02:24 1741449744

Agreed - I am surprised they did are not using Pixtral to read images as well.

bilater · 2025-03-08T03:15:27 1741403727

OK I've been critical of Mistral AI but credit where credit is due. Mistral OCR seems cool.

So cool in fact, I got distracted and ended up building an open source PDF parser and chat app!

Presenting Auntie PDF - your all-knowing guide that unpacks every PDF into clear, actionable insights.

You can upload a pdf or point to a public link, parse it, and then ask questions. All open source and free.

onebitwise · 2025-03-08T03:59:08 1741406348

Lovely idea.

Not working for me on a file like this: https://files.catbox.moe/gii0pu.pdf It says that is larger than 10MB (it's 7MB), or failed on url.

bilater · 2025-03-09T01:28:42 1741483722

thanks. yeah there is an annoying limit on vercel's end for body size of 4.5 mb. there are solutions to get around it by uploading directly to a storage solution but i wanted to keep the app simple for folks who want to dive into the code.

pogue · 2025-03-08T03:33:21 1741404801

That's pretty impressive for a project just released yesterday. I wish I had the energy and wherewithal to put stuff together that fast.

bilater · 2025-03-08T03:41:14 1741405274

Thanks! Cursor definitely helps me speed things up haha.

jbaudanza · 2025-03-08T05:21:38 1741411298

I have a question about Mistral OCR. If I give the model a PDF that is 90% text, is it actually performing OCR on an image representation of the text? Or is it smart enough to extract the text directly and only use OCR on images?

foundzen · 2025-03-08T06:39:46 1741415986

Love the creativity in the branding but it did not work in my case either. Gibberish raw content and error in answering any question.

t-3 · 2025-03-08T06:06:56 1741414016

What are people using these OCR programs for? Are there really that many PDFs being made without embedded text these days?

qingcharles · 2025-03-08T08:08:50 1741421330

I have a million+ scanned PDF documents going back 300 years that are mostly just images of text.

elanning · 2025-03-08T04:00:13 1741406413

It looks great, nice work. I’m impressed at the quick development too.

bilater · 2025-03-08T16:04:57 1741449897

thanks!

JoelJacobson · 2025-03-08T06:06:51 1741414011

Thanks for creating, really useful!

Would be nice with a [Download Combined Rendered] button to download a self-contained .html web page of the rendered combined page.

bilater · 2025-03-08T16:48:24 1741452504

Added!

shnpln · 2025-03-08T04:44:30 1741409070

I would like it if my chat session did not clear if go to Document Content and back to chat. Or I wish I could see my document when chatting.

daft_pink · 2025-03-09T07:02:50 1741503770

Is there a way to use mistral ocr on us servers so your data never leaves our borders?

mjyoon · 2025-03-08T04:35:22 1741408522

Unfortunate that Mistral OCR can't tell me details presented in charts and graphs.

yannis · 2025-03-08T04:57:56 1741409876

Pretty impressive and did a good job for an academic pdf I uploaded. Nice UI also.

bilater · 2025-03-08T22:27:39 1741472859

thanks!

ab_testing · 2025-03-08T04:56:26 1741409786

This is amazing. Could you share the prompts that were used for this product ?

bilater · 2025-03-08T16:05:13 1741449913

its not something you build with just a prompt. at least not yet.

triyambakam · 2025-03-08T06:23:53 1741415033

The coolest thing about this is the short and easy to pronounce .com

bilater · 2025-03-08T22:28:03 1741472883

haha yes I was very happy to get the domain

n8m8 · 2025-03-08T04:11:55 1741407115

im on mobile and don’t have a pdf to test it with, but I love your styling and text copy.

bilater · 2025-03-08T16:05:28 1741449928

thanks!

throwaway81348 · 2025-03-08T04:11:19 1741407079

what about privacy?

eastoeast · 2025-03-08T05:10:41 1741410641

Awesome UI!

bilater · 2025-03-08T16:05:39 1741449939

thanks!