To get good RAG performance you will need a good chunking strategy. Simply getting all the text is not good enough and knowing the boundaries of table, list, paragraph, section etc. is helpful.
Feel free to try - https://github.com/nlmatics/llmsherpa. It is fully open source - both client and server and it not ML augmented, so very fast and cheap to run.
Thanks for the post. Please use this server with the llmsherpa LayoutPDFReader to get optimal chunks for your LLM/RAG project: https://github.com/nlmatics/llmsherpa. See examples and notebook in the repo.
You can see examples in llmsherpa project - https://github.com/nlmatics/llmsherpa. This project nlm-ingestor provides you the backend to work with llmsherpa. The llmsherpa library is very convenient to use for extracting nice chunks for your LLM/RAG project.
No, we are not doing the same thing. Most cloud parsers use a vision model and they are lot slower, expensive and you need to write code on the top of these to extract good chunks.
To run the docker image on apple silicon, you can use the following command to pull - it will be slower but works:
docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-ingestor:latest
Thanks, I always forget I can do that! I've given it a go and it's really impressive – the default chunker is very smart and manages to keep most of the chunk context together
The table parser in particular is really good. Is the trick that you draw some guide lines and rectangles around tables? I'm trying to understand the GraphicsStreamProcessor class as I'm not familiar with Tika, how does it know where to draw in the first place?
Great work by llamaindex team. Also feel free to try https://github.com/nlmatics/llmsherpa which takes into account some of the things I mentioned.