I have actually made (what I think to be) a working summarizer at $dayjob and it took a lot more hand-holding to get results than I initially expected. A straight summary wasn't very good, the "summary of summaries" approach as implemented by LangChain was garbage and didn't produce a summary at all, even a wrong one. The algorithm that actually worked:
1. Take the documents, chunk them up on paragraph then sentence then word boundaries using spaCy.
2. Generate embeddings for each chunk and cluster them using the silhouette score to estimate the number of clusters.
3. Take the top 3 documents closest to the centroid of each cluster, expand the context before and after so it's 9 chunks in 3 groups.
4. For each cluster ask the LLM to extract the key points as direct quotes from the document.
5. Take those quotes and match them up to the real document to make sure it didn't just make stuff up.
6. Then put all the quotes together and ask the LLM not to summarize, but to write the information presented in paragraph form.
7. Then because LLMs just can't seem to shut up about their answer make them return JSON {"summary": "", "commentary": ""} and discard the commentary.
The LLM performs much better (to human reviewers) at keyphrase extraction than TextRank so I think there's genuinely some value there and obviously nothing else can really compose english like these models but I think we perhaps expect too much out of the "raw" model.
It's cool to hear about something substantial that isn't just an api call to the "God" machine. This sounds like a pretty sophisticated and well considered approach, but it does call in to question whether a system that needs to be used as a subsystem in a larger one to be reliable is worth the amount of money being invested in making them right now.
1. Take the documents, chunk them up on paragraph then sentence then word boundaries using spaCy.
2. Generate embeddings for each chunk and cluster them using the silhouette score to estimate the number of clusters.
3. Take the top 3 documents closest to the centroid of each cluster, expand the context before and after so it's 9 chunks in 3 groups.
4. For each cluster ask the LLM to extract the key points as direct quotes from the document.
5. Take those quotes and match them up to the real document to make sure it didn't just make stuff up.
6. Then put all the quotes together and ask the LLM not to summarize, but to write the information presented in paragraph form.
7. Then because LLMs just can't seem to shut up about their answer make them return JSON {"summary": "", "commentary": ""} and discard the commentary.
The LLM performs much better (to human reviewers) at keyphrase extraction than TextRank so I think there's genuinely some value there and obviously nothing else can really compose english like these models but I think we perhaps expect too much out of the "raw" model.