asukla's comments

asukla · on Feb 22, 2024

To get good RAG performance you will need a good chunking strategy. Simply getting all the text is not good enough and knowing the boundaries of table, list, paragraph, section etc. is helpful.

Great work by llamaindex team. Also feel free to try https://github.com/nlmatics/llmsherpa which takes into account some of the things I mentioned.

asukla · on Feb 22, 2024

Feel free to try - https://github.com/nlmatics/llmsherpa. It is fully open source - both client and server and it not ML augmented, so very fast and cheap to run.

pryelluw · on Feb 22, 2024

Thank you for sharing.

asukla · on Jan 24, 2024

I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

asukla · on Jan 24, 2024

Thanks for the post. Please use this server with the llmsherpa LayoutPDFReader to get optimal chunks for your LLM/RAG project: https://github.com/nlmatics/llmsherpa. See examples and notebook in the repo.

asukla · on Jan 24, 2024

You can see examples in llmsherpa project - https://github.com/nlmatics/llmsherpa. This project nlm-ingestor provides you the backend to work with llmsherpa. The llmsherpa library is very convenient to use for extracting nice chunks for your LLM/RAG project.

asukla · on Jan 24, 2024

No, we are not doing the same thing. Most cloud parsers use a vision model and they are lot slower, expensive and you need to write code on the top of these to extract good chunks.

You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.

asukla · on Jan 24, 2024

You can use the library in conjunction with llmsherpa LayoutPDFReader.

Some examples are here with notebook: https://github.com/nlmatics/llmsherpa Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...

asukla · on Jan 24, 2024

To run the docker image on apple silicon, you can use the following command to pull - it will be slower but works: docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-ingestor:latest

mpeg · on Jan 24, 2024

Thanks, I always forget I can do that! I've given it a go and it's really impressive – the default chunker is very smart and manages to keep most of the chunk context together

The table parser in particular is really good. Is the trick that you draw some guide lines and rectangles around tables? I'm trying to understand the GraphicsStreamProcessor class as I'm not familiar with Tika, how does it know where to draw in the first place?