But you don't actually need that level of performance? You've made this system more complex and expensive to achieve a requirement that doesn't matter?
you seem to have a deeper knowledge of the business & organisational context that dictate the true requirements than someone working there. please share these details so we can all learn!
Sure: the network request time of a person making a request over the open internet is going to be an order of magnitude longer than a DB lookup (in the right style, with a reverse-index) on the scale of data this person is describing. So making the lookup 10x faster saves you...1% of the request latency.
And at the qps they've described, it's not a throughput issue either. So I'm pretty confident in saying that this is a case of premature optimization.
And at some point the increase in parallelization of scans dominates mmap speed, unless you're redundantly sharding your mmaped hash table across multiple machines. And there are cases where network bandwidth is the bottleneck before disk bandwidth, though probably not this case. But yeah basically, the answer is something like "if this is the optimal choice, it probably didnt matter that much".
This reads to me as if you have never really used mmap in a dedicated C/C++ application. Just to give you a data point, looking up one word_id in the LUT and reading 20 document_ids from it takes on average 0.0000015 ms.
So if that alternative database takes on average 0.1ms per index read, then it's starting out roughly 65000x slower.
"than a DB lookup (in the right style, with a reverse-index)"
Unless, of course, you're managing petabytes of data ;)
"at the qps they've described, it's not a throughput issue either"
It's mostly a cost thing. If a single request takes 2x the time, that's also a 2x on the hosting bill.
"parallelization of scans dominates mmap speed"
Yes, eventually that might happen. Roughly when you have 100000 servers. But before that your 10gbit/s node-to-node link will saturate. Oops.
> Unless, of course, you're managing petabytes of data ;)
Are...are you saying that you've purchased petabyte(s) of RAM, and that that multi-million dollar investment is somehow cheaper than...well really anything else?
> But before that your 10gbit/s node-to-node link will saturate. Oops.
Only if you're returning dense results, which it sounds like you aren't (and there are ways to address this anyhow), which is why I said the issue of saturating network before disk probably wasn't an issue for you ;)
No, of course I have a tiered architecture. HDDs + SSDs + RAM. By mmap-ing the file, the Linux kernel will make sure that whatever data I access is in RAM and it'll do best-effort pre-reading and caching, which works very well.
BTW, this is precisely how "real databases" also handle their storage IO internally. So all of the performance cost I have to pay here, they have to pay, too.
But the key difference is that with a regular database and indices, the database needs to be able to handle read and write loads, which leads to all sorts of undesirable trade-offs for their indices. I can use a mathematically perfect index if I split dataset generation off of dataset hosting.
It's really quite difficult to explain, so I'll just redirect you to the algorithms. A regular database will typically use a B-tree index, which is O(log(N)). I'm using a direct hash bucket look-up, which is O(1).
For a mental model, you can think of "mmap" as "all the results are already in RAM, you just need to read the correct variable". There is no network connection, no SQL parsing, no query planning, no index scan, no data retrieval. All those steps would just consume unnecessary RAM bandwidth and CPU usage. So where a proper DB needs 1000+ CPU cycles, I might get away with just 1.
No modern DB uses mmap because it's unreliable and hard to tune for performance.
A custom cache manager will always perform better than mmap provided by the kernel.
The problem is you haven't explained how the overhead of a DB is too much. Sure, it sounds like a lot of work for your servers and the DB compared to reading from a hashmap.
Where I work right now we fire around 1.5B queries a day... to Mongo.
What kind of servers are you running? What's your max QPS?
The fact is with your mmap impl. you probably use ram + virtual memory, and have more ram than needed to compensate for the fact that you don't keep the most used keys in memory, which a DB will do for you.
Point is if you have petabytes of data and access patterns only mean you access a subset of it, even Mongo might be cheaper to run.
Just FYI, MongoDB storage also uses mmap internally.
So we are comparing here "just mmap" with "mmap + all that connection handling, query parsing, JSON formatting, buffering, indexing, whatever stuff that MongoDb does".
And no, MongoDB is effectively never a cheap solution. They are used because they are super convenient to work with, with all things being JSON documents. But all that conversion to and from JSON comes at a price. It'll eat up 1000s of CPU cycles just to read a single document. With raw mmap, you could read 1000s of documents instead.
MongoDB uses the Wired Tiger storage engine internally. The MMAP storage engine was removed from MongoDB in V4.2 which was released in March 2020. The MMAP engine was deprecated two years previously.
In MongoDB conversion to and from raw JSON into BSON (Binary JSON) is done on the client (aka driver) so the server cycles are not consumed.