Hacker News new | past | comments | ask | show | jobs | submit login

I've been working on something similar - https://github.com/jamii/springer-recommendations

All the open-source engines I've tried scale poorly. Since our input data already contains >200m interactions I suspect recommendify would struggle (from my quick reading it looks like the co-concurrence matrix is stored in redis ie in-memory).

The approach I'm leaning towards at the moment is collating the article->IPs and IP->articles tables in leveldb and then distributing read-only copies to each worker. Everything else can easily be partitioned by article id.




I can't tell from the README - is your data fairly wide? I'm playing with using Postgres's new K-Nearest-Neighbor support to calculate similarity on 20D cubes, but I suspect my approach won't work well for an arbitrary number of columns (i.e. users x products) unless you first do some sort of PCA or SVD to narrow it down, and it isn't optimized for binary ratings at all. I started writing it up here: http://parapoetica.wordpress.com/2012/02/15/feature-space-si...


> ... is your data fairly wide?

Around 200m download logs, 2m articles, some million IP addresses. I suspect that interest in research papers is inherently high dimensional and dimensional reduction would probably damage the results.

I don't have much hardware to throw at it either. I just started looking at randomized algorithms - trying to produce a random walk on the download graph that links articles with probability proportional to some measure of similarity (probably cosine distance or Jaccard index).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: