If this can get me tables out of pdf's generated by crystal reports it would be ...

mjt58 · on Nov 28, 2018

Have you tried e.g. https://tabula.technology, https://pdftables.com, https://pypi.org/project/Camelot/?

counciltime · on Dec 6, 2018

I have a friend who has also developed a number of applications that use OCR specifically for PDF which uses Tesseract. The Report Miner application does a nice job of locating and extracting PDF tables.

https://www.opait.com/tesseractstudio/

https://www.opait.com/Pdfreportminer/

minhtripham · on Dec 17, 2018

Would love to learn more about the apps your friend developed--currently doing research into different OCR use cases + tech. can you shoot me an email at minh@docucharm.com?

BasHamer · on Nov 28, 2018

https://pdftables.com failed the test file, pretty good but inconsistent interpretation across rows, sometimes it split the cell, sometimes it did not. Tabula failed to detect multi-line rows, after manually changing the table it did do better than pdftables.com on splitting cells. Both failed the non-printable whitespace characters that created garbled outputs in the excel. The other one would take some time to rig up.

ocrcustomserver · on Nov 28, 2018

You can also try https://docparser.com/.

If nothing works for you and you're comfortable with sharing an example file, you can send it to me and I could take a look.

cdolan · on Nov 29, 2018

Rather than the Camelot link you provided, I think you meant Excalibur? https://github.com/camelot-dev/excalibur

mjt58 · on Nov 29, 2018

Oh yes, thanks :-)

RandomBookmarks · on Nov 28, 2018

How about https://ocr.space/tablerecognition

It returns table data line by line.

BasHamer · on Nov 28, 2018

handled the non-printed whitespace but butchered the multi- line table headers, so re-building the headers is rough as it is line by line and you need to know what words go together and you have lost the structure.

cdolan · on Nov 29, 2018

Can you send me a copy of what you are trying to extract? We use proprietary stuff (we're in the business of extracting data and performing analysis on invoices for waste, recycling, cellular, etc... stuff that gets "lost" in the AP department.

Happy to see if our tools can help. I've tried everything on the market - DocParser, MediusFlow, KOFAX, Ephesoft, etc... none work well enough in my opinion.

BasHamer · on Nov 29, 2018

I should be able to get you some files, getting approval now; can you let me know how to contact you?

cdolan · on Nov 30, 2018

I changed my about to have a phonetic spelling of my email address, hosted on a very popular domain name. Feel free to toss me an email