Apache Tika Server is very easy to set up - it can be configured to use tesserac...

mgkimsal · on July 30, 2024

Came here to mention Tika. I just set up a small POC with the 'full' tika docker container - default OCR bundled (with... 5 languages? English, Spanish, etc).

I parsed a PDF and when looking at the output, I noticed 'united stotes of america' was in the text. Didn't make any sense... Digging further, I saw that it had also parsed the images in the PDF, and one of them was some govt logo with bad artifacting. It did indeed read more like 'stotes' than 'states'.

Edit: That said, the OP asked about tables. I haven't tested any table stuff with tika (not something I need right now). Is the tika table support any good? Does it even exist? Seems like it might not really matter for many tika use cases (but I might be missing something obvious!)