My same thought, because some don't require much memory.

akie · on April 14, 2024

The use case is caching 20 million API responses that almost never change, each about 20kb of JSON, for a high traffic site.

Yes, I can pay for a 400Gb RAM instance of Redis, but it's expensive.

I can also cache it on disk, but then I need to think about cache expiration myself.

Or I can use something appropriate like a document database, but then I need additional code & additional configuration because we otherwise don't need that piece of infrastructure in our stack.

It would be a lot easier if I could just store it in Redis with the other (more reasonably sized) things that I need to cache.

seddonm1 · on April 14, 2024

In other abuses of SQLite, I wrote a tool [0] that exposes blobs in SQLite via an Amazon S3 API. It doesn't do expiry (but that would be easy enough to add if S3 does it).

We were using it to manage a millions of images for machine learning as many tools support S3 and the ability to add custom metadata to objects is useful (harder with files). It is one SQLite database per bucket but at the bucket level it is transactional.

0: https://github.com/seddonm1/s3ite

Nican · on April 14, 2024

This looks like a good use case for ScyllaDB with Compression and TTL. It is pretty simple to setup a single-node instance.

If you rather have something in-process and writes to disk, to avoid extra infrastructure, I would also recommend RocksDB with Compression and TTL.

phamilton · on April 14, 2024

A few things:

Redis Data Tiering - Redis Enterprise and AWS Elasticache for Redis support data tiering (using SSD for 80% of the dataset and moving things in and out). On AWS, a cache.r6gd.4xlarge with 100GB of memory can handle 500GB of data.

Local Files

> I can also cache it on disk, but then I need to think about cache expiration myself.

Is the challenge that you need it shared among many machines? On a single machine you can put 20 million files in a directory hierarchy and let the fs cache keep things hot in memory as needed. Or use SQLite which will only load the pages needed for each query and also rely on the fs cache.

S3 - An interesting solution is one of the SQLite S3 VFS's. Those will query S3 fairly efficiently for specific data in a large dataset.

mannyv · on April 15, 2024

If you're caching 20m api requests that almost never change use s3. I mean heck, it's cheaper, can be cdn'd, and can be pipelined by the browser.

reese_john · on April 14, 2024

You could try using Amazon S3 Express, a low-latency alternative for S3 buckets [0]. I imagine cache invalidation would be relatively simple to implement using lifecycle policies.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-exp...

danpat · on April 14, 2024

Or shard it - divide your objects up based on some criteria (hash the name of the object, use the first N digits of the hash to assign to a shard), and distribute them across multiple redis instances. Yes, you then need to maintain some client code to pick the right redis instance to fetch from, but you can now pick the most $/memory efficient instance types to run redis, and you don't have to worry about introducing disk read latency and the edge cases that brings with it.

Edit: looks like redis has some built-in support for data sharding when used as a cluster (https://redis.io/docs/latest/commands/cluster-shards/) - I haven't used that, so not sure how easy it is to apply, and exactly what you'd have to change.

yuliyp · on April 14, 2024

Sharding doesn't help here at all. They'd still need the same amount of RAM to house all the data in redis.

yuliyp · on April 14, 2024

You're trying to get redis to be what it isn't. Use a thing that has the properties you want: a document or relational database. If you insist on this then running a system that allows a ton of swap onto a reasonably fast disk might work, but is still gonna perform worse than a system that's designed for concurrently serving queries of wildly differing latencies.

minitech · on April 15, 2024

Yes, that was the context: a Redis alternative to support the same operations but be what Redis isn’t.

0cf8612b2e1e · on April 14, 2024

Would DiskCache work for you? Runs via SQLite memory or persisted file database. Thread safe, has various expiration controls, etc.

https://grantjenks.com/docs/diskcache/tutorial.html

tayo42 · on April 15, 2024

Have you looked at varnish for caching api responses? Varnish let's you back it with disk and relies on page cache to keep more accessed items in memory.

If the reverse proxy thing doesn't work I think memcached has two level storage like that now iirc

emmanueloga_ · on April 15, 2024

Check memcached "extstore" [1].

--

1: https://github.com/memcached/memcached/wiki/Extstore

kiitos · on April 14, 2024

Redis is an in-memory cache by definition. If you don't want to cache in-memory, then don't use Redis.