Around 200m download logs, 2m articles, some million IP addresses. I suspect that interest in research papers is inherently high dimensional and dimensional reduction would probably damage the results.
I don't have much hardware to throw at it either. I just started looking at randomized algorithms - trying to produce a random walk on the download graph that links articles with probability proportional to some measure of similarity (probably cosine distance or Jaccard index).
Around 200m download logs, 2m articles, some million IP addresses. I suspect that interest in research papers is inherently high dimensional and dimensional reduction would probably damage the results.
I don't have much hardware to throw at it either. I just started looking at randomized algorithms - trying to produce a random walk on the download graph that links articles with probability proportional to some measure of similarity (probably cosine distance or Jaccard index).