Because for example Abby replaces the image with an PDF with invisible text over...

milesokeefe · on Nov 9, 2017

Abbyy has had that ability for a longer period of time so it may be more accurate, but this is something tesseract supports[1]. In fact I'd say most OCR systems support it with varying degrees of accuracy.

[1]:https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-do-i...

RandomBookmarks · on Nov 10, 2017

The ability to create searchable PDFs is very useful and convenient. But creating searchable PDFs does not require a deep understanding of the document format (like column detection etc). You just place the words at the right coordinates of their bounding boxes. You can test this for example here: https://ocr.space - select the option to create a searchable PDF. It works even for the most complex documents.

Now, creating a Word document from a scan is a different beast because it requires layout analysis. This is where Abbyy with its long experience still has a good lead.

pault · on Nov 9, 2017

My issue was that I needed one that exposes an API for doing OCR on pictures taken from mobile devices. It was really difficult to find non-desktop packages.

staticautomatic · on Nov 9, 2017

Out of curiosity, what did you need beyond something like OpenCV/Metal + Tesseract?

pault · on Nov 9, 2017

The client needed an OCR solution for supplier invoices with a variety of layouts and a combination of printed and hand-written characters, and didn't have the budget for a bespoke solution. To be fair, it's a very hard problem, I was just surprised that given all the much hyped recent advancements in deep learning for computer vision, most of the solutions in the market seem to be running on decades old technology.

ocrcustomserver · on Nov 10, 2017

Well it is more complex than it appears.

Extracting data from documents requires a solution which uses OCR but is a different product (e.g. ABBYY FlexiCapture).

This is most commonly referred to as zonal OCR and comes with the added functionality of handling multiple templates, defining zones/fields, specifying special rules for fields, verification process for manual inspection (e.g. triggered when the image receives a low confidence recognition score) etc. This is different and more complex than a product that does full page OCR (e.g. ABBYY Finereader).

Handwritten OCR is a whole different story. The products that do zonal OCR will fail to recognize handwritten text, unless it's in boxes (PDF forms). I'm working on a prototype that can handle handwritten text outside of boxes too.