> that I need to search fairly infrequently (but sometimes in bulk).
What do you mean by search ? Full-text-search ? Do you need to run custom code on the original data ?
> A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3.
Yes, it's like keeping the block-index of a sstable (in rocksdb) in-memory. The next step is to have a local cache on the ec2 node. And the next step one is to have a "distributed" cache on your ec2 nodes, so you don't query S3 for a chunk if it's present in any of your other nodes.
Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.
> Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.
If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap
> If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap
Search maybe is too strong a word - "lookup" is probably more correct. I have a couple of identifiers for each document, from which I want to retrieve the full doc.
I'm not sure what you mean by running custom code on the data. I usually do some kind of transformation afterwards.
I didn't find anything either, which is why I was wondering if I was searching for the wrong thing.
How big is each document ? If documents are big, keep each of them as a separate file and store the ids in a database. If documents are small, then you want something like https://github.com/rockset/rocksdb-cloud for a building block
What do you mean by search ? Full-text-search ? Do you need to run custom code on the original data ?
> A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3.
Yes, it's like keeping the block-index of a sstable (in rocksdb) in-memory. The next step is to have a local cache on the ec2 node. And the next step one is to have a "distributed" cache on your ec2 nodes, so you don't query S3 for a chunk if it's present in any of your other nodes.
Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.