> ... is your data fairly wide? Around 200m download logs, 2m articles, some...

> ... is your data fairly wide?

Around 200m download logs, 2m articles, some million IP addresses. I suspect that interest in research papers is inherently high dimensional and dimensional reduction would probably damage the results.

I don't have much hardware to throw at it either. I just started looking at randomized algorithms - trying to produce a random walk on the download graph that links articles with probability proportional to some measure of similarity (probably cosine distance or Jaccard index).