I have actually made (what I think to be) a working summarizer at $dayjob and it...

I have actually made (what I think to be) a working summarizer at $dayjob and it took a lot more hand-holding to get results than I initially expected. A straight summary wasn't very good, the "summary of summaries" approach as implemented by LangChain was garbage and didn't produce a summary at all, even a wrong one. The algorithm that actually worked:

1. Take the documents, chunk them up on paragraph then sentence then word boundaries using spaCy.

2. Generate embeddings for each chunk and cluster them using the silhouette score to estimate the number of clusters.

3. Take the top 3 documents closest to the centroid of each cluster, expand the context before and after so it's 9 chunks in 3 groups.

4. For each cluster ask the LLM to extract the key points as direct quotes from the document.

5. Take those quotes and match them up to the real document to make sure it didn't just make stuff up.

6. Then put all the quotes together and ask the LLM not to summarize, but to write the information presented in paragraph form.

7. Then because LLMs just can't seem to shut up about their answer make them return JSON {"summary": "", "commentary": ""} and discard the commentary.

The LLM performs much better (to human reviewers) at keyphrase extraction than TextRank so I think there's genuinely some value there and obviously nothing else can really compose english like these models but I think we perhaps expect too much out of the "raw" model.