We have been using different things for text, images, and tables. I think it's w...

Tade0 · on July 30, 2024

> I think it's worth pointing out that PDFs are extremely messy under-the-hood so expecting perfect output is a fool's errand

This.

A while ago someone asked me why their banking solution doesn't allow to paste payment amounts (among other things) and surely there must be a way to do it correctly.

Not with PDF. What a person reads as a single number may be any grouping of entities which may or may not paste correctly.

Some banks simply don't want to deal with this sort of headache.

mkaszkowiak · on July 30, 2024

How do you combine the outputs? Wouldn't there be data duplication between unstructured text and tables?

whakim · on July 30, 2024

We just skip several of unstructured's categories, such as tables and images. We also do some deduplication post-ANN as we want to optimize for novelty as well as relevance. That being said, how are you planning to embed an image or a table to make it searchable? It sounds simple in theory, but how do you generate an actually good image summary (without spending huge amounts of money filling OpenAI's coffers for negligible benefit)? How do you embed a table?

mkaszkowiak · on July 30, 2024

Thanks for answering! In my case, I don't directly use RAG; but rather post-process documents via LLMs to extract a set of specific answers. That's also why I've asked about deduplication - asking LLM to provide an answer from 2 different data sources (invalid unstructured table text & valid structured table contents) quickly ramps up errors.