Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At billion vector scale, doing this yourself is pretty impossible



Faiss has long discussed strategies for scaling to 1B - 1T records here - https://github.com/facebookresearch/faiss/wiki/Indexing-1G-v...

There are plenty of options available to run your own local vector database, txtai is one of them. Ultimately depends if you have a sizable development team or not. But saying it is impossible is a step too far.


Even in that article with much smaller vectors than what GPT puts out (1536 dimensions) QPS drops below 100 if recall@1 is more than 0.4. That's to say nothing of cost of regenerating this index using incremental updates. I don't get why people on HN are so adamant on the idea that no one needs scale beyond 1 machine ever.


The comment said that having an instance with 1B+ vectors yourself is impossible. Clearly that's not the case.


If you have a billion vectors, is “yourself” a large tech company who does stuff like roll their own browsers, programming languages, invents kubernetes etc. Probably could roll this! And indeed sell this.


Last time I had to deal with vector representation of documents was more than 10 years ago, so I'm a bit rusty, but billion vector scale sound relatively trivial.


With retrieval time in the milliseconds? The entries may be ads, or something else user facing. Your users are not going to sit around while you leisurely retrieve them.


not particularly?

1B vectors * 300dimensions * float32 (4 Bytes) ~= 1.2TB

This pretty much still runs on consumer hardware.

Just run that on a 4TB nvme ssd, or a RAID array of ssd's if you're frisky.


You do realize you have to query an index of all of that data for every single query your use makes right? Computing that index is not entirely trivial, nor is the operation of partitioning the data so it fits in ram across a pool of nodes.

Sure, role your own, but don’t act like making a highly scalable database is a weekend project.


Consumer hardware can still handle that with 1TB RAM + ThreadRipper Pro.

> You do realize you have to query an index of all of that data for every single query your use makes right? Computing that index is not entirely trivial, nor is the operation of partitioning the data so it fits in ram across a pool of nodes.

I don't know what any of this means -- and it sounds like you're slapping a bunch of terminology together, rather than communicating a well-thought-out idea.

Yes, in the general case you're going to have to use an index. Computing an index or a key to that index? Computing the index is a solved problem, that does not have a hard real-time component -- you can do it outside of normal query executions. Computing the key to the index on each query is also a solved problem.

Have dimensions stored in columnar format, generate a sparse primary index on said columns, and then use binary search to quickly find the blocks of interest to do a sequential search on viz. distance function. Or you could even just use regular old SS trees, SR trees, or M Trees for high-dimensional indexing -- they're not expensive to use at all.

There, you can easily run a query on a single dimension (1 billion entries) under a second. You want 300 dimensions? Ok, parallelize it. 128 threads, easy. At most this will take 3 seconds if everything is configured properly (big IF, that seems like few can get right).

This is literally a weekend project. Anyone can build something like this, but not everyone has the integrity to be upfront about how they're reinventing the wheel, and spinning it like they've just broken ground in database R&D.


A second is orders of magnitude off the typical SLA for these things. It's user facing. That's why these databases are a thing.


What kind of QPS are you looking at? How are you handling 1536 dimensions? How long does an incremental index update take? These are the problems you run into in building such a system.


I'm not familiar with the index part, but you can get at least 2TB on a single CPU socket these days. You shouldn't need multiple machines to fit in RAM. Depending on what QPS you need to handle, you might also be fine to not have the whole thing fit in RAM.


My point was, specifically, that this data doesnt have to fit in RAM.

All of it fits on a single machine on one or a few big, fast SSDs.


A big SSD is 30 TB now: https://www.newegg.com/micron-30-72-tb-9400/p/N82E1682036315...

So that kind of dataset fits on a small SSD. :-)


lol, not true. Even for huge vectors (1000 page docs), today you can do this with enough disk storage with something like leveldb on a single node, and in memory with something like ScaNN for nearest neighbor.


What kind of QPS are you getting and how fast are incremental index updates? That's the hard part.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: