Mistral is Europe based where invoices are more or less sent digitally in like 95% of all the cases anyway. Some are even digital invoices, which will at some point in the eu be mandatory. For orders there are proposals for that, too. And basically invoice data extraction is a different beast.
One use-case is digitising receipts from business related travels for expenses that employees paid for out of their own pocket and which they are submitting pictures to the business for reimbursement.
Bus travels, meals including dinners and snacks, etc. for which the employee has receipts on paper.
Yeah, digitizing receipts is still a huge challenge for most companies, especially for expense reimbursements. Even though invoices are increasingly digital, employees still end up with physical receipts for work-related expenses. From what I've seen, there are some interesting contenders like Klippa that seem to solve exactly this problem [1].
Curious to know if anyone heard of or used their OCR or a similar tool. Apparently it's not an LLM in disguise but an actual AI trained on gazillions of documents so the risk of hallucination might be lower than these LLM OCR solutions like Mistral.
Receipts are different. And they are harder to OCR. Thermo prints are most often aweful in quality. Most often you need to correct some stuff when dealing with them. I doubt that this tech changes that significantly.
So an invoice attached to an email as a PDF is sent digitally ... those unfamiliar with PDF will think text and data extraction is trivial then, but this isn't true. You can have a fully digital, non-image PDF that is vector based and has what looks like text, but doesn't have a single piece of extractable text in it. It's all about how the PDF was generated. Tables can be formatted in a million ways, etc.
Your best bet is to always convert it to an image and OCR it to extract structured data.
This is simply not true. Maybe it’s easier and you do not need 100% precision. But it is actually possible to extract text and layout of digital pdfs. Else it would be impossible to display it.
Of course some people still add image fragments to a pdf, but that practice is basically dying. I did not see a single pdf the last year we‘re it was impossible to extract the layout.
So your eu customer will send you the invoice via letters ? Wow. There are some companies that still deal with printed invoices, but they are most often smaller companies that deal with health related things.