What I really want to see is an HNSW index type for SQLite so finding similar ve...

tomd · on Feb 7, 2023

You might be interested in https://datasette.io/plugins/datasette-faiss, which I'm using alongside openai-to-sqlite for similarity search of embeddings, following @simonw's excellent instructions at https://simonwillison.net/2023/Jan/13/semantic-search-answer...

uh_uh · on Feb 7, 2023

Thanks, but the index being in-memory makes it unsuitable for large data sets :/

simonw · on Feb 7, 2023

There is a way of running disk-backed FAISS indexed that don't all fit in memory but I've not quite figured out how to do that yet: https://github.com/facebookresearch/faiss/issues/2675

jamesblonde · on Feb 7, 2023

OpenSearch K-NN plugin supports FAISS and it's disk based:

https://opensearch.org/docs/latest/search-plugins/knn/index/

uh_uh · on Feb 9, 2023

OpenSearch looks like the best so far, all my requirements combined!

iandanforth · on Feb 7, 2023

Can you say more? Usually projects that gravitate to SQLlite are not those that require massive scale and a FAISS index of a few GB covers a lot of documents.

uh_uh · on Feb 7, 2023

My dataset is going to be around 10M documents. With OpenAI embeddings, that will be around 62GB. AFAIK SQLite should be able to handle that size, but I haven't tried.

This is not going to be my primary DB. I would update this maybe once a day and the update doesn't have to be super fast.

sharemywin · on Feb 7, 2023

you might check out some vector databases:

https://milvus.io/

AND

pinecone.io

there are others too

CGamesPlay · on Feb 7, 2023

You could do it using an interface similar to the FTS tables in SQLite, where the “index” is actually a table that needs to be kept up to date with the underlying data (you can use triggers to keep them in sync). Querying is a little awkward compared to a normal index, but for these complex, inexact indexes you often want the extra flexibility (like computing the match score).

uh_uh · on Feb 7, 2023

That's promising, is there a way to integrate this with other filters in the SQL query? Scenario: let's say I have documents and each one has a timestamp and an embedding vector. I should be able to write a query like this:

select * from documents where timestamp=today() order by similarity(embedding, query_vector)

In this case the query planner should be smart enough such that it first filters by timestamp and within those records remaining it does the HNSW search. I'm unsure whether SQLite's extension interface is flexible enough to support this.

giraffe_lady · on Feb 7, 2023

I don't think so. You can have unindexed columns on fts tables but I don't think you can directly combine non-fts query filters in that way.

You can do it, but you need to manually filter from the main table then subquery the fts index, or use a join, etc. It's powerful but the interface isn't as friendly as it could be in that way.

The sqlite extension system is powerful enough that an extension could unify the behaviors as you describe. It's just not how the fts5 extension interface is exposed, and afaik no one has written an extension on top of it to make it easier to use.

CGamesPlay · on Feb 7, 2023

An SQLite virtual table can specify a “cost” to a particular query plan, you you could weight your index search higher and discourage SQLite from using it to filter many rows. The tuning process works by specifying a number of “disk reads” or estimated rows for a particular query plan.

More info: https://www.sqlite.org/vtab.html#the_xbestindex_method

tomd · on Feb 7, 2023

> I'm unsure whether SQLite's extension interface is flexible enough to support this.

I think it is, if I've understood your requirements correctly. e.g. from the datasette-faiss docs:

  with related as (
    select value from json_each(
      faiss_search(
        'intranet',
        'embeddings',
        (select embedding from embeddings where id = :id),
        5
      )
    )
  )
  select id, title from articles, related
  where id = value