Hmm, I have to say I'm pretty unimpressed with my initial experience here.
1. The sign up with email just endlessly redirected, click link in email, ask to sign up with email, put in email, click link in email, etc.
2. Fine, I'll sign in with Google.
3. A PDF parser? Seriously that's what all this fuss is about? There are so many options already out there, PDFBox, iText, Unstructured, PyPDF, PDF.js, PdfMiner not to mention extraction services available from the hyperscalers. Super confused why anyone needs this.
LLaMA Index is way more than a PDF parser. It's the most widely used RAG tool chain and their cloud looks to be a managed version of that.
Specific to the parser, they do show where tools like those you mentioned fail and their LLM based parser captures the full data the aforementioned miss.
Yeah, but their platform is basically a janky PDF parser which is why I don't understand what the hype is about.
It's easy to cherry pick a PDF for marketing purposes and claim you're better. I didn't miss it, I just don't believe marketing announcements at face value. I tried their parser on a PDF with a bit of complex formatting like multiple columns, tables and a couple images and it choked, spitting out one big markdown header with jumbled text. Not impressed.
To get good RAG performance you will need a good chunking strategy. Simply getting all the text is not good enough and knowing the boundaries of table, list, paragraph, section etc. is helpful.
1. The sign up with email just endlessly redirected, click link in email, ask to sign up with email, put in email, click link in email, etc.
2. Fine, I'll sign in with Google.
3. A PDF parser? Seriously that's what all this fuss is about? There are so many options already out there, PDFBox, iText, Unstructured, PyPDF, PDF.js, PdfMiner not to mention extraction services available from the hyperscalers. Super confused why anyone needs this.