I worked on something similar for clustering BBC news articles. The (Ruby) code ...

I worked on something similar for clustering BBC news articles. The (Ruby) code I used is here: https://github.com/bbcrd/similarity

I didn't account for names entities or n-grams in the feature vector though. That's a very interesting idea.

@mattdeboard - what algorithm did you use to count the occurrence and size of clusters?