Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's shocking how much our industry fails to see past its own nose.

Not a single example on that page is a Purchase Order, Invoice etc. Not a single example shown is relevant to industry at scale.



Mistral is Europe based where invoices are more or less sent digitally in like 95% of all the cases anyway. Some are even digital invoices, which will at some point in the eu be mandatory. For orders there are proposals for that, too. And basically invoice data extraction is a different beast.


One use-case is digitising receipts from business related travels for expenses that employees paid for out of their own pocket and which they are submitting pictures to the business for reimbursement.

Bus travels, meals including dinners and snacks, etc. for which the employee has receipts on paper.


Yeah, digitizing receipts is still a huge challenge for most companies, especially for expense reimbursements. Even though invoices are increasingly digital, employees still end up with physical receipts for work-related expenses. From what I've seen, there are some interesting contenders like Klippa that seem to solve exactly this problem [1].

Curious to know if anyone heard of or used their OCR or a similar tool. Apparently it's not an LLM in disguise but an actual AI trained on gazillions of documents so the risk of hallucination might be lower than these LLM OCR solutions like Mistral.

[1] https://www.klippa.com/en/ocr/ocr-api/


Receipts are different. And they are harder to OCR. Thermo prints are most often aweful in quality. Most often you need to correct some stuff when dealing with them. I doubt that this tech changes that significantly.


So an invoice attached to an email as a PDF is sent digitally ... those unfamiliar with PDF will think text and data extraction is trivial then, but this isn't true. You can have a fully digital, non-image PDF that is vector based and has what looks like text, but doesn't have a single piece of extractable text in it. It's all about how the PDF was generated. Tables can be formatted in a million ways, etc.

Your best bet is to always convert it to an image and OCR it to extract structured data.


This is simply not true. Maybe it’s easier and you do not need 100% precision. But it is actually possible to extract text and layout of digital pdfs. Else it would be impossible to display it. Of course some people still add image fragments to a pdf, but that practice is basically dying. I did not see a single pdf the last year we‘re it was impossible to extract the layout.


even in Europe this is still a thing, I know of systems which still are unable to read items having more than one line (costing s sh*tload of money)


This isn't even close to true.

Source: We have large EU customers.


So your eu customer will send you the invoice via letters ? Wow. There are some companies that still deal with printed invoices, but they are most often smaller companies that deal with health related things.


Our EU customers use our technology to deal with all the invoices etc. they get sent as PDFs.


Can confirm, in Italy electronic invoicing is mandatory since 2019


Another good example would be contracts of any kind. Imagine photographing a contract (like a car loan) and on the spot getting an AI to read it, understand it, forecast scenarious, highlight red flags, and do some comparison shopping for you.


... imagining ...

... hallucinating during read ...

... hallucinating during understand ...

... hallucinating during forecast ...

... highlighting a hallucination as red flag ...

... missing an actual red flag ...

... consuming water to cool myself...

Phew, being an AI is hard!


Your points are well-taken, but I think that contracts are a small enough, and well represented enough in the corpus, to actually be pretty solid. This is especially true with good prompting and some sort of feedback loop.


Fwiw, they have an example of a parking receipt in a cookbook: https://colab.research.google.com/github/mistralai/cookbook/...


I wanted to apply OCR to my company's invoicing since they basically did purchasing for a bunch of other large companies, but the variability in the conversion was not tolerable. Even rounding something differently could catch an accountant's eye, let alone detecting a "8" as a "0" or worse.


Agreed. In general I've had such bad performance for complex table based invoice parsing, that every few months I try the latest models to see if its better. It does say "96.12" on top-tier benchmark under the Table category.


Businesses at scale use EDI to handle purchase orders and invoices, no OCR needed.


Thats simply not a factual statement.

Scaled businesses do USE edi, but they still receive hundreds of thousands of PDF documents a month

source: built a saas product that handles pdfs for a specific industry


Agreed, though in this case, they are going for general-purpose OCR. That's fine in some cases, but purpose-built models trained on receipts, invoices, tax documents, etc., definitely perform better. We've got a similar API solution coming out soon (https://digital.abbyy.com/code-extract-automate-your-new-mus...) that should work better for businesses automating their docs at scale.


We find CV models to be better (higher midpoint on an ROC curve) for the types of docs you mention.


To be fair: Reading the blog post, the main objective seems to have been to enable information extraction with high confidence for the academic sector (e.g. unlocking all these paper pdfs), and not necessarily to be another receipt scanner.


It hilarious that the academic sector 1. publishes as PDF 2. spends all this energy on how to extract that info back from PDF 3. publishes that research as PDF as well.

Receipt scanning is a multiple orders of magnitude more valuable business. Mistral at this point is looking for a commercial niche (like how Claude is aiming at software development)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: