Karpathy (OpenAI, ex. Tesla) recently tweeted about KISS and just using np.array...

dmezzetti · on May 5, 2023

100% agree. There are so many simple use cases where people are jumping to a complex option to start. For something like 10K records, a NumPy or PyTorch matrix operation could be enough.

Recently published an article discussing this: https://neuml.hashnode.dev/customize-your-own-embeddings-dat...

javier2 · on May 5, 2023

Well, we have about 4B vectors we want to index, with a constantly changing dataset. We are actually running just fine on Elasticsearch with ~1B docs indexed right now, but the hardware costs are looking expensive.

dmezzetti · on May 5, 2023

With 4B vectors, you can look at methods like quantization and compression, both detailed here for Faiss - https://github.com/facebookresearch/faiss/wiki/Indexing-1G-v...

Elasticsearch uses HNSW, not sure what options they have but quantization/compression will help reduce disk storage requirements. Alternatively, you can look at dimensionality reduction algorithms and only store that output in ES. Or pick a model with a small number of dimensions. For example https://huggingface.co/sentence-transformers/all-MiniLM-L6-v... only has 384 dims vs 768/1024/2048/4096.

0xDEF · on May 5, 2023

>Karpathy (OpenAI, ex. Tesla) recently tweeted about KISS and just using np.array instead of a vector database

The context was a very underwhelming side project of his: A movie search engine but you had to use the exact titles of the movies to get results. It only revealed that he doesn't appreciate what similarity search actually is.

It feels almost blasphemous to call a Karpathy side project underwhelming. He is a genius and it really felt unlike him to write that "just use np.array" tweet.

HarHarVeryFunny · on May 5, 2023

I don't recall the context in that much detail, but I'd have to give him the benefit of the doubt!

Surely the whole point of a vector "database" in that context would be to store semantic sentence embeddings of the movies titles to support approximate / semantically-related search ? Could do the same thing for movie plot synopsis too - allow user to search via vague descriptions of movie. ChatGPT actually does very well at this, although massive overkill.

peterstjohn · on May 5, 2023

It definitely depends on your use case. If you are just searching through the entire array at all times, then this is certainly an acceptable option (you could even flip it all onto a GPU too).

But when you start to require filtering or combining the vector search with a lexical search, then something like Pinecone, Vespa, Qdrant, Lucene-based options (e.g. Solr and ES) etc. become a lot more practical than you building all that functionality yourself.