The underlying dilemma is that so many of these stores are not really "related"....

basicallydan · on April 18, 2012

It is helpful that the same story is often written off of PA or AP articles and thus includes pretty much the same info - but it doesn't change the fact that stories on the same subject almost always include the same set of key words unique to that subject whether or not they were rewritten from a press release. That's the beauty of TF.IDF weighting - that it'll cluster stuff based off of words that are uniquely important in one article.

jrfinkel · on April 18, 2012

Sometimes they're just rewrites of a press release - and those ones are easy to get right - but a lot of the time they really are totally different articles about the same event. Go back and look at the three sets of example clusters and you'll see what I mean.

mattdeboard · on April 17, 2012

Doesn't Google do something similar to this for their news aggregator?

rabidsnail · on April 18, 2012

They also have access to the link graph, so they probably use that for clustering instead of looking at the text. Pages with lots of inbound linkers in common are likely to be similar.