If this can get me tables out of pdf's generated by crystal reports it would be a godsend for testing. This has been a nightmare to try and solve, the best option so far has been adobe cloud but they don't offer an API for that. I'm excited to try it out.
I have a friend who has also developed a number of applications that use OCR specifically for PDF which uses Tesseract. The Report Miner application does a nice job of locating and extracting PDF tables.
Would love to learn more about the apps your friend developed--currently doing research into different OCR use cases + tech. can you shoot me an email at minh@docucharm.com?
https://pdftables.com failed the test file, pretty good but inconsistent interpretation across rows, sometimes it split the cell, sometimes it did not.
Tabula failed to detect multi-line rows, after manually changing the table it did do better than pdftables.com on splitting cells.
Both failed the non-printable whitespace characters that created garbled outputs in the excel.
The other one would take some time to rig up.
handled the non-printed whitespace but butchered the multi- line table headers, so re-building the headers is rough as it is line by line and you need to know what words go together and you have lost the structure.
Can you send me a copy of what you are trying to extract? We use proprietary stuff (we're in the business of extracting data and performing analysis on invoices for waste, recycling, cellular, etc... stuff that gets "lost" in the AP department.
Happy to see if our tools can help. I've tried everything on the market - DocParser, MediusFlow, KOFAX, Ephesoft, etc... none work well enough in my opinion.