It's shocking how much our industry fails to see past its own nose. Not a single...

merb · 2025-03-06T20:38:37 1741293517

Mistral is Europe based where invoices are more or less sent digitally in like 95% of all the cases anyway. Some are even digital invoices, which will at some point in the eu be mandatory. For orders there are proposals for that, too. And basically invoice data extraction is a different beast.

codetrotter · 2025-03-06T21:05:21 1741295121

One use-case is digitising receipts from business related travels for expenses that employees paid for out of their own pocket and which they are submitting pictures to the business for reimbursement.

Bus travels, meals including dinners and snacks, etc. for which the employee has receipts on paper.

keepsweet · 2025-03-10T15:13:40 1741619620

Yeah, digitizing receipts is still a huge challenge for most companies, especially for expense reimbursements. Even though invoices are increasingly digital, employees still end up with physical receipts for work-related expenses. From what I've seen, there are some interesting contenders like Klippa that seem to solve exactly this problem [1].

Curious to know if anyone heard of or used their OCR or a similar tool. Apparently it's not an LLM in disguise but an actual AI trained on gazillions of documents so the risk of hallucination might be lower than these LLM OCR solutions like Mistral.

[1] https://www.klippa.com/en/ocr/ocr-api/

merb · 2025-03-07T05:17:57 1741324677

Receipts are different. And they are harder to OCR. Thermo prints are most often aweful in quality. Most often you need to correct some stuff when dealing with them. I doubt that this tech changes that significantly.

revnode · 2025-03-06T21:04:13 1741295053

So an invoice attached to an email as a PDF is sent digitally ... those unfamiliar with PDF will think text and data extraction is trivial then, but this isn't true. You can have a fully digital, non-image PDF that is vector based and has what looks like text, but doesn't have a single piece of extractable text in it. It's all about how the PDF was generated. Tables can be formatted in a million ways, etc.

Your best bet is to always convert it to an image and OCR it to extract structured data.

merb · 2025-03-07T05:13:11 1741324391

This is simply not true. Maybe it’s easier and you do not need 100% precision. But it is actually possible to extract text and layout of digital pdfs. Else it would be impossible to display it. Of course some people still add image fragments to a pdf, but that practice is basically dying. I did not see a single pdf the last year we‘re it was impossible to extract the layout.

wolfi1 · 2025-03-06T20:42:24 1741293744

even in Europe this is still a thing, I know of systems which still are unable to read items having more than one line (costing s sh*tload of money)

kiratp · 2025-03-06T22:08:40 1741298920

This isn't even close to true.

Source: We have large EU customers.

merb · 2025-03-07T05:15:34 1741324534

So your eu customer will send you the invoice via letters ? Wow. There are some companies that still deal with printed invoices, but they are most often smaller companies that deal with health related things.

kiratp · 2025-03-10T03:55:48 1741578948

Our EU customers use our technology to deal with all the invoices etc. they get sent as PDFs.

napolux · 2025-03-06T21:00:00 1741294800

Can confirm, in Italy electronic invoicing is mandatory since 2019

simpaticoder · 2025-03-06T20:35:43 1741293343

Another good example would be contracts of any kind. Imagine photographing a contract (like a car loan) and on the spot getting an AI to read it, understand it, forecast scenarious, highlight red flags, and do some comparison shopping for you.

JBiserkov · 2025-03-06T22:53:42 1741301622

... imagining ...

... hallucinating during read ...

... hallucinating during understand ...

... hallucinating during forecast ...

... highlighting a hallucination as red flag ...

... missing an actual red flag ...

... consuming water to cool myself...

Phew, being an AI is hard!

simpaticoder · 2025-03-06T23:48:31 1741304911

Your points are well-taken, but I think that contracts are a small enough, and well represented enough in the corpus, to actually be pretty solid. This is especially true with good prompting and some sort of feedback loop.

kashnote · 2025-03-06T20:15:09 1741292109

Fwiw, they have an example of a parking receipt in a cookbook: https://colab.research.google.com/github/mistralai/cookbook/...

sha16 · 2025-03-06T21:09:43 1741295383

I wanted to apply OCR to my company's invoicing since they basically did purchasing for a bunch of other large companies, but the variability in the conversion was not tolerable. Even rounding something differently could catch an accountant's eye, let alone detecting a "8" as a "0" or worse.

guiomie · 2025-03-06T20:22:44 1741292564

Agreed. In general I've had such bad performance for complex table based invoice parsing, that every few months I try the latest models to see if its better. It does say "96.12" on top-tier benchmark under the Table category.

arpinum · 2025-03-06T20:44:39 1741293879

Businesses at scale use EDI to handle purchase orders and invoices, no OCR needed.

cdolan · 2025-03-06T21:06:18 1741295178

Thats simply not a factual statement.

Scaled businesses do USE edi, but they still receive hundreds of thousands of PDF documents a month

source: built a saas product that handles pdfs for a specific industry

dotnetkow · 2025-03-06T22:18:03 1741299483

Agreed, though in this case, they are going for general-purpose OCR. That's fine in some cases, but purpose-built models trained on receipts, invoices, tax documents, etc., definitely perform better. We've got a similar API solution coming out soon (https://digital.abbyy.com/code-extract-automate-your-new-mus...) that should work better for businesses automating their docs at scale.

mtillman · 2025-03-06T20:28:51 1741292931

We find CV models to be better (higher midpoint on an ROC curve) for the types of docs you mention.

mentalgear · 2025-03-06T21:07:50 1741295270

To be fair: Reading the blog post, the main objective seems to have been to enable information extraction with high confidence for the academic sector (e.g. unlocking all these paper pdfs), and not necessarily to be another receipt scanner.

kiratp · 2025-03-06T22:11:19 1741299079

It hilarious that the academic sector 1. publishes as PDF 2. spends all this energy on how to extract that info back from PDF 3. publishes that research as PDF as well.

Receipt scanning is a multiple orders of magnitude more valuable business. Mistral at this point is looking for a commercial niche (like how Claude is aiming at software development)