Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Indexing semantic versions in RocksDB (aawadia.dev)
28 points by adamretter on Jan 5, 2024 | hide | past | favorite | 17 comments



> "The technical primitive data structure here is a hashmap where the keys are sorted."

..somewhere, this person's algorithms teacher is pondering about his life choices.


I don’t see why that’s a particularly bad explanation of how RocksDB works


It's not using hashes.

(OK, well, it is maybe using hashes for the per-SST Bloom filter, but that's not what's interesting here. It's maybe also using hashes in the memtable in which case they're using it very inefficiently since the key encoding is not prefix-scan-friendly.)


I don't know anything about rocksdb, but this approach on surface level seems like it could be very slow? Wouldn't it be more efficient to encode the semver in a format more suitable to sorting


By default RocksDB uses a ByteWiseComparator to sort the keys in the SST. However, RocksDB allows you to provide any comparator you wish. So ultimately it will depend on the performance of the comparator that you implement.


Sure, the old performance/productivity tradeoff.

This quickly solves the problem then you can iterate for performance.


Using RocksDB here seems fairly insane unless you need to keep (at least) several billion constantly-updating versions sorted. Otherwise you can just use SortedMap and regular Java comparators.


They might just want cheap disk persistence. I'm not a Java developer, but if SortedMaps are a memory-only data structure, persisting them reliably doesn't look an easy feat.


RocksDB is disk persistence but it sure ain't (for this scale) cheap.

Persisting just about anything in Java is easier than persisting it in Java in RocksDB.


How would you persist a Sorted Map in Java easily? Serializing and deserializing is OK, but keep in mind that if the program crashes you still need to persist all the data that was saved before the crash (so you can't keep in memory and serialize at exit, or at a time interval)


Given the size of any dataset of the type being discussed here, you dump the whole thing to disk on every change.


> The technical primitive data structure here is a hashmap where the keys are sorted.

Why not a tree based map instead?


HashMap and sorted keys sounds like a contradiction. I think they just use "HashMap" as a generic term for "Map" and it's actually a tree based map.


It is an LSMT, but I cannot imagine why they would need one here.


I am curious, I recently wrote a naive hashmap for C. I am curious about iterating in insert and sort order.

Is it possible for a hash function to maintain a sort relationship to it's input and output?


Supporting insertion order is straightforward (but has some trade-offs) - you store the values in a backing array or linked list, and the hash table array stores pointers to the values.

You could do something similar with a sorted data structure backing the hashmap to get sorted order (a b-tree or something similar).

You could have the hash function preserve order at the cost of it being a very bad hash function. If you had n buckets, then the first 1/n elements in the keyspace would map to bucket 0, then next 1/n elements map to bucket 1, etc.


If you iterate a python dictionary it will return the keys in insert order.

For sort order you need to sort separately.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: