Cool! Which OCR engine/model do you use?

pierre · on Feb 20, 2024

EasyOCR, may switch to paddleOCR in the future.

vikp · on Feb 20, 2024

You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss.

You could also try https://github.com/VikParuchuri/marker for general PDF parsing (I'm also the author) - it seems like you're more focused on tables.

raffraffraff · on Feb 20, 2024

How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.

vikp · on Feb 21, 2024

I unfortunately haven't had time to benchmark against more than tesseract.

kergonath · on Feb 21, 2024

That’s my experience as well. I am still looking for alternatives, but Textract is now the baseline.

pryelluw · on Feb 21, 2024

Thanks for sharing.

joaquincabezas · on Feb 21, 2024

PaddleOCR works pretty well, how are you planning to integrate it in your workflow? I found huge differences in throughput between python serving and frameworks (i.e. NVIDIA Triton Inference Server).

helloericsf · on Feb 21, 2024

Grateful for your insight! Could you explain the reason for the switch? Is there any benchmark data available for sharing?

pierre · on Feb 21, 2024

Performance depend on the language / type of docs. Main reason for contemplating switching is that easyOCR seems to not be maintained anymore (no commit in the repo in last 5 months)