Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If this can get me tables out of pdf's generated by crystal reports it would be a godsend for testing. This has been a nightmare to try and solve, the best option so far has been adobe cloud but they don't offer an API for that. I'm excited to try it out.



I have a friend who has also developed a number of applications that use OCR specifically for PDF which uses Tesseract. The Report Miner application does a nice job of locating and extracting PDF tables.

https://www.opait.com/tesseractstudio/

https://www.opait.com/Pdfreportminer/


Would love to learn more about the apps your friend developed--currently doing research into different OCR use cases + tech. can you shoot me an email at minh@docucharm.com?


https://pdftables.com failed the test file, pretty good but inconsistent interpretation across rows, sometimes it split the cell, sometimes it did not. Tabula failed to detect multi-line rows, after manually changing the table it did do better than pdftables.com on splitting cells. Both failed the non-printable whitespace characters that created garbled outputs in the excel. The other one would take some time to rig up.


You can also try https://docparser.com/.

If nothing works for you and you're comfortable with sharing an example file, you can send it to me and I could take a look.


Rather than the Camelot link you provided, I think you meant Excalibur? https://github.com/camelot-dev/excalibur


Oh yes, thanks :-)


How about https://ocr.space/tablerecognition

It returns table data line by line.


handled the non-printed whitespace but butchered the multi- line table headers, so re-building the headers is rough as it is line by line and you need to know what words go together and you have lost the structure.


Can you send me a copy of what you are trying to extract? We use proprietary stuff (we're in the business of extracting data and performing analysis on invoices for waste, recycling, cellular, etc... stuff that gets "lost" in the AP department.

Happy to see if our tools can help. I've tried everything on the market - DocParser, MediusFlow, KOFAX, Ephesoft, etc... none work well enough in my opinion.


I should be able to get you some files, getting approval now; can you let me know how to contact you?


I changed my about to have a phonetic spelling of my email address, hosted on a very popular domain name. Feel free to toss me an email




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: