Hacker News new | past | comments | ask | show | jobs | submit | asukla's comments login

To get good RAG performance you will need a good chunking strategy. Simply getting all the text is not good enough and knowing the boundaries of table, list, paragraph, section etc. is helpful.

Great work by llamaindex team. Also feel free to try https://github.com/nlmatics/llmsherpa which takes into account some of the things I mentioned.


Feel free to try - https://github.com/nlmatics/llmsherpa. It is fully open source - both client and server and it not ML augmented, so very fast and cheap to run.


Thank you for sharing.


I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.


Thanks for the post. Please use this server with the llmsherpa LayoutPDFReader to get optimal chunks for your LLM/RAG project: https://github.com/nlmatics/llmsherpa. See examples and notebook in the repo.


You can see examples in llmsherpa project - https://github.com/nlmatics/llmsherpa. This project nlm-ingestor provides you the backend to work with llmsherpa. The llmsherpa library is very convenient to use for extracting nice chunks for your LLM/RAG project.


No, we are not doing the same thing. Most cloud parsers use a vision model and they are lot slower, expensive and you need to write code on the top of these to extract good chunks.

You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.


You can use the library in conjunction with llmsherpa LayoutPDFReader.

Some examples are here with notebook: https://github.com/nlmatics/llmsherpa Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...


To run the docker image on apple silicon, you can use the following command to pull - it will be slower but works: docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-ingestor:latest


Thanks, I always forget I can do that! I've given it a go and it's really impressive – the default chunker is very smart and manages to keep most of the chunk context together

The table parser in particular is really good. Is the trick that you draw some guide lines and rectangles around tables? I'm trying to understand the GraphicsStreamProcessor class as I'm not familiar with Tika, how does it know where to draw in the first place?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: