Indexing semantic versions in RocksDB

utopcell · on Jan 5, 2024

> "The technical primitive data structure here is a hashmap where the keys are sorted."

..somewhere, this person's algorithms teacher is pondering about his life choices.

klysm · on Jan 6, 2024

I don’t see why that’s a particularly bad explanation of how RocksDB works

morelisp · on Jan 6, 2024

It's not using hashes.

(OK, well, it is maybe using hashes for the per-SST Bloom filter, but that's not what's interesting here. It's maybe also using hashes in the memtable in which case they're using it very inefficiently since the key encoding is not prefix-scan-friendly.)

zokier · on Jan 5, 2024

I don't know anything about rocksdb, but this approach on surface level seems like it could be very slow? Wouldn't it be more efficient to encode the semver in a format more suitable to sorting

adamretter · on Jan 16, 2024

By default RocksDB uses a ByteWiseComparator to sort the keys in the SST. However, RocksDB allows you to provide any comparator you wish. So ultimately it will depend on the performance of the comparator that you implement.

hdhfjkrkrme · on Jan 5, 2024

Sure, the old performance/productivity tradeoff.

This quickly solves the problem then you can iterate for performance.

morelisp · on Jan 5, 2024

Using RocksDB here seems fairly insane unless you need to keep (at least) several billion constantly-updating versions sorted. Otherwise you can just use SortedMap and regular Java comparators.

curtisblaine · on Jan 6, 2024

They might just want cheap disk persistence. I'm not a Java developer, but if SortedMaps are a memory-only data structure, persisting them reliably doesn't look an easy feat.

morelisp · on Jan 6, 2024

RocksDB is disk persistence but it sure ain't (for this scale) cheap.

Persisting just about anything in Java is easier than persisting it in Java in RocksDB.

curtisblaine · on Jan 7, 2024

How would you persist a Sorted Map in Java easily? Serializing and deserializing is OK, but keep in mind that if the program crashes you still need to persist all the data that was saved before the crash (so you can't keep in memory and serialize at exit, or at a time interval)

morelisp · on Jan 7, 2024

Given the size of any dataset of the type being discussed here, you dump the whole thing to disk on every change.

gnulinux · on Jan 5, 2024

> The technical primitive data structure here is a hashmap where the keys are sorted.

Why not a tree based map instead?

The_Colonel · on Jan 5, 2024

HashMap and sorted keys sounds like a contradiction. I think they just use "HashMap" as a generic term for "Map" and it's actually a tree based map.

morelisp · on Jan 5, 2024

It is an LSMT, but I cannot imagine why they would need one here.

samsquire · on Jan 5, 2024

I am curious, I recently wrote a naive hashmap for C. I am curious about iterating in insert and sort order.

Is it possible for a hash function to maintain a sort relationship to it's input and output?

blyzz · on Jan 5, 2024

Supporting insertion order is straightforward (but has some trade-offs) - you store the values in a backing array or linked list, and the hash table array stores pointers to the values.

You could do something similar with a sorted data structure backing the hashmap to get sorted order (a b-tree or something similar).

You could have the hash function preserve order at the cost of it being a very bad hash function. If you had n buckets, then the first 1/n elements in the keyspace would map to bucket 0, then next 1/n elements map to bucket 1, etc.

hdhfjkrkrme · on Jan 5, 2024

If you iterate a python dictionary it will return the keys in insert order.

For sort order you need to sort separately.