Hmm, I have to say I'm pretty unimpressed with my initial experience here. 1. Th...

verdverm · on Feb 21, 2024

LLaMA Index is way more than a PDF parser. It's the most widely used RAG tool chain and their cloud looks to be a managed version of that.

Specific to the parser, they do show where tools like those you mentioned fail and their LLM based parser captures the full data the aforementioned miss.

kurts_mustache · on Feb 21, 2024

Yeah, but their platform is basically a janky PDF parser which is why I don't understand what the hype is about.

It's easy to cherry pick a PDF for marketing purposes and claim you're better. I didn't miss it, I just don't believe marketing announcements at face value. I tried their parser on a PDF with a bit of complex formatting like multiple columns, tables and a couple images and it choked, spitting out one big markdown header with jumbled text. Not impressed.

cheesyFishes · on Feb 21, 2024

There is a PDF parser, LlamaParse, (which is open to everyone), and a managed ingestion/retrieval service, that is currently invite-only.

Planning broader releases in the future for sure.

asukla · on Feb 22, 2024

To get good RAG performance you will need a good chunking strategy. Simply getting all the text is not good enough and knowing the boundaries of table, list, paragraph, section etc. is helpful.

Great work by llamaindex team. Also feel free to try https://github.com/nlmatics/llmsherpa which takes into account some of the things I mentioned.