We have been using different things for text, images, and tables. I think it's worth pointing out that PDFs are extremely messy under-the-hood so expecting perfect output is a fool's errand; transformers are extremely powerful and can often do surprisingly well even when you've accidentally mashed a set of footnotes into the middle of a paragraph or something.
For text, unstructured seems to work quite well and does a good job of quickly processing easy documents while falling back to OCR when required. It is also quite flexible with regards to chunking and categorization, which is important when you start thinking about your embedding step. OTOH it can definitely be computationally expensive to process long documents which require OCR.
For images, we've used PyMuPDF. The main weakness we've found is that it doesn't seem to have a good story for dealing with vector images - it seems to output its own proprietary vector type. If anyone knows how to get it to output SVG that'd obviously be amazing.
For tables, we've used Camelot. Tables are pretty hard though; most libraries are totally fine for simple tables, but there are a ton of wild tables in PDFs out there which are barely human-readable to begin with.
For tables and images specifically, I'd think about what exactly you want to do with the output. Are you trying to summarize these things (using something like GPT-4 Vision?) Are you trying to present them alongside your usual RAG output? This may inform your methodology.
> I think it's worth pointing out that PDFs are extremely messy under-the-hood so expecting perfect output is a fool's errand
This.
A while ago someone asked me why their banking solution doesn't allow to paste payment amounts (among other things) and surely there must be a way to do it correctly.
Not with PDF. What a person reads as a single number may be any grouping of entities which may or may not paste correctly.
Some banks simply don't want to deal with this sort of headache.
We just skip several of unstructured's categories, such as tables and images. We also do some deduplication post-ANN as we want to optimize for novelty as well as relevance. That being said, how are you planning to embed an image or a table to make it searchable? It sounds simple in theory, but how do you generate an actually good image summary (without spending huge amounts of money filling OpenAI's coffers for negligible benefit)? How do you embed a table?
Thanks for answering! In my case, I don't directly use RAG; but rather post-process documents via LLMs to extract a set of specific answers. That's also why I've asked about deduplication - asking LLM to provide an answer from 2 different data sources (invalid unstructured table text & valid structured table contents) quickly ramps up errors.
For text, unstructured seems to work quite well and does a good job of quickly processing easy documents while falling back to OCR when required. It is also quite flexible with regards to chunking and categorization, which is important when you start thinking about your embedding step. OTOH it can definitely be computationally expensive to process long documents which require OCR.
For images, we've used PyMuPDF. The main weakness we've found is that it doesn't seem to have a good story for dealing with vector images - it seems to output its own proprietary vector type. If anyone knows how to get it to output SVG that'd obviously be amazing.
For tables, we've used Camelot. Tables are pretty hard though; most libraries are totally fine for simple tables, but there are a ton of wild tables in PDFs out there which are barely human-readable to begin with.
For tables and images specifically, I'd think about what exactly you want to do with the output. Are you trying to summarize these things (using something like GPT-4 Vision?) Are you trying to present them alongside your usual RAG output? This may inform your methodology.