Over the years, I've found myself building hacky solutions to serve and manage my embeddings. I’m excited to share Embeddinghub, an open-source vector database for ML embeddings. It is built with four goals in mind:
Store embeddings durably and with high availability
Allow for approximate nearest neighbor operations
Enable other operations like partitioning, sub-indices, and averaging
Manage versioning, access control, and rollbacks painlessly
It's still in the early stages, and before we committed more dev time to it we wanted to get your feedback. Let us know what you think and what you'd like to see!
In the "Definitive Guide to Embeddings", in the figure "An illustration of One Hot Encoding", the "One Hot Encoding" table doesn't make any sense whatsoever. Am I wrong?
You are both right. I just realized this and would be embarrassed if I wasn’t laughing so hard. I gave an original drawing to our designer with the correct values and we didn’t inspect their final image. We’ll get this fixed, thanks for pointing this out and sorry for the confusion :)
I see you've got examples for NLP use cases in your docs. Can't wait to read them. Embeddings are a constant source of complexity when I'm trying to move certain operations to Lambda, this looks like it would speed the initializations up big time.
We actually use HNSWLIB by NMSLIB on the backend. NMSLIB is solving the approximate nearest neighbor problem, not the storage problem. It’s not a database, it’s an index. We handle everything needed to turn their index into a full fledged database with a data science workflow around it (versioning, monitoring, etc.)
Where can I find documentation on versioning? My first use case would be to versión different embeddings and use it more like a storage backend than to search for KNN. Would it be possible to not create the NN graph and just use it for versioned storage? We currently use opendistro and it nicely allows doing pre and post filtering based on other document fields (other than the embedding). Therefore I think this could never be a full replacement without figuring out how to combine the rest of the document structure
Hey! We're actually polishing up a PR that'll add documentation and finalize the versioning API, it should be merged in this weekend. Would you be up for a quick chat with someone on our team? It would be interesting to get your feedback and see what else we're missing to be a drop-in replacement to opendistro, join our slack if so. We'll dm you :) https://join.slack.com/t/featureform-community/shared_invite...
Pinecone is closed source and only available as a SaaS service. Milvus and us have more overlap, we’re focused on the embeddings workflow like versioning and using embedding with other features. Milvus is entirely focused on nearest neighbor operations.
Faiss is solving the approximate nearest neighbor problem, not the storage problem. It’s not a database, it’s an index. We use a lightweight version of Faiss (HNSWLIB) to index embeddings in Embeddinghub.
The biggest difference, as cyrusthegreat pointed out, is that we're a fully managed service. You sign up, spin up a database service with a single API call[0], and go from there. There's no infrastructure to build and keep available, even as you scale to billions of items.
Pinecone also comes with features like metadata filtering[1] for better control over results, and hybrid storage for up to 10x lower compute costs. EmbeddingHub has a few features Pinecone doesn't yet have, like versioning -- though with our architecture it's straightforward to add if someone asks.
Hope that helps! And I'm glad to see more projects in this space, especially from the feature-store side.
Not yet, this is very much an early release to get it in people's hands and to get feedback on the API and the functionality. We've purposely held off optimizing too much until we feel more confident that this is useful and our API approach makes sense for people. That said, Simba who's one of the main devs actually comes from a performance tuning background at Google. Also, it's built on HNSWLIB and RocksDB, and is being used in real world workloads today.
This is really great! It speaks very much to my use-case (building user embeddings and serving them both to analysts + other ML models).
I was wondering if there was a reasonable way to store raw data next to the embeddings such that:
1. Analysts can run queries to filter down to a space they understand (the raw data).
2. Nearest neighbors can be run on top of their selection on the embedding space.
Our main use case is segmentation, so giving analysts access to the raw feature space is very important.
Great work! Looks like you are using HNSWLIB. From what I understand HNSW graph based approach can be memory intensive compared PQ code based approach. FAISS has support for both HNSW and PQ codes. Any plans on extending your work to support PQ code based index in future?
Yes! We plan to bring Faiss in and utilize a lot of its functionality, our goal for this release was to get an end-to-end working to get feedback on the API. HNSW was a good default with this in mind.
How does it compare to the OpenDistro for Elastic KNN plugin - which also uses HNSW (and also includes scalable storage, high availability, backups, and filtering)?
Our API is built from the ground up with the machine learning workflow in mind. For example, we have a training API that allows you to batch requests and even download your embeddings and generate an HNSW index locally. Our view of versioning, rollbacks, and more makes a lot of sense for an ML index, but very little sense for a search index.
Gensim is great for generating certain types of embeddings, but not for operationalizing them. It doesn’t do approximate nearest neighbor lookup which is a deal breaker for most models that use embeddings at scale. It also do not manage versioning so you end up having to hack a workflow around it to manage embedding. Finally, it’s not really data infrastructure like this is, so you end up doing hacky things like copying all your embeddings to every docker file. With regards to serving embeddings, gensim is just a library that supports in-memory brute force nearest neighbour look ups.
Faiss actually also uses HNSW internally, HNSWLIB is just a lighter weight implementation which allowed us to iterate faster. In the future we will switch it back out for FAISS to take advantage of its full array of functionality.
Over the years, I've found myself building hacky solutions to serve and manage my embeddings. I’m excited to share Embeddinghub, an open-source vector database for ML embeddings. It is built with four goals in mind:
Store embeddings durably and with high availability
Allow for approximate nearest neighbor operations
Enable other operations like partitioning, sub-indices, and averaging
Manage versioning, access control, and rollbacks painlessly
It's still in the early stages, and before we committed more dev time to it we wanted to get your feedback. Let us know what you think and what you'd like to see!
Repo: https://github.com/featureform/embeddinghub
Docs: https://docs.featureform.com/
What's an Embedding? The Definitive Guide to Embeddings: https://www.featureform.com/post/the-definitive-guide-to-emb...