Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss.

You could also try https://github.com/VikParuchuri/marker for general PDF parsing (I'm also the author) - it seems like you're more focused on tables.



How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.


I unfortunately haven't had time to benchmark against more than tesseract.


That’s my experience as well. I am still looking for alternatives, but Textract is now the baseline.


Thanks for sharing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: