> that I need to search fairly infrequently (but sometimes in bulk). What do you...

hungnv · on May 7, 2021

> Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.

If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap

ddorian43 · on May 7, 2021

> If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap

My whole point was not building it ;)

There's also https://github.com/NVIDIA/aistore

natpat · on May 7, 2021

> What do you mean by search ?

Search maybe is too strong a word - "lookup" is probably more correct. I have a couple of identifiers for each document, from which I want to retrieve the full doc.

I'm not sure what you mean by running custom code on the data. I usually do some kind of transformation afterwards.

I didn't find anything either, which is why I was wondering if I was searching for the wrong thing.

ddorian43 · on May 7, 2021

How big is each document ? If documents are big, keep each of them as a separate file and store the ids in a database. If documents are small, then you want something like https://github.com/rockset/rocksdb-cloud for a building block