Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Apache Tika Server is very easy to set up - it can be configured to use tesseract for OCR.


Came here to mention Tika. I just set up a small POC with the 'full' tika docker container - default OCR bundled (with... 5 languages? English, Spanish, etc).

I parsed a PDF and when looking at the output, I noticed 'united stotes of america' was in the text. Didn't make any sense... Digging further, I saw that it had also parsed the images in the PDF, and one of them was some govt logo with bad artifacting. It did indeed read more like 'stotes' than 'states'.

Edit: That said, the OP asked about tables. I haven't tested any table stuff with tika (not something I need right now). Is the tika table support any good? Does it even exist? Seems like it might not really matter for many tika use cases (but I might be missing something obvious!)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: