I was actually just looking at this problem last week, when I realized I needed ...

zeroonetwothree · on Jan 24, 2025

Wikipedia has a section on this algorithm under exponential labels: https://en.m.wikipedia.org/wiki/List-labeling_problem

Basically it works as long as your label space is large compared to the number of items. The more sophisticated methods are necessary when that isn’t the case. For example, say you have 4 bytes for the label and 1 billion items.

crdrost · on Jan 24, 2025

This sort of problem also occurs when you're trying to do CRDTs, which can roughly be described also as "design something that does Git better."

So e.g. to frame this, one approach to a CRDT is to just treat the document as a list of facts, "line 1 is 'foo', line 2 is 'bar'", and each fact has a number of times it has been asserted, and to "merge" you just add together the assertion counts, and then you can detect conflicts when a fact has been asserted more than once or fewer than zero times. So a patch says "change line 2 to 'baz'", this becomes "unassert that line 2 is 'bar', assert that line 2 is 'baz'" and it conflicts with a patch that says "change line 2 to 'quux'" because the fact "line 2 is 'bar'" has an assertion count of -1.

But anyway, in this context you might want to allow inserting lines, and then you have the list-labeling problem, you don't want the patch to unassert lines 4,5,6 just to insert a new line after line 3. So then an obvious thing is to just use a broader conception of numbers, say "line 3.5 is <X>" when you insert, and then we hide the line numbers from the user anyways, they don't need to know that internally the line numbers of the 7 lines go "1, 2, 3, 3.5, 4, 5, 6".

So then you need a relabeling step because you eventually have some line at 3.198246315 and you want to be able to say "yeah, that's actually line 27, let's have some sanity again in this thing."

This also maybe hints at the fun of adding randomization, consider that one person might add line 3.5, then add line 3.75, and then remove line 3.5; meanwhile someone else might add a different line 3.5, add a line 3.25, and then remove their line 3.5, and these patches would both amount to "assert line 3.25 is A, assert line 3.75 is B", and would merge without conflict. This means that in general if two people are messing with the same part of the same document asynchronously, this model is not able to consistently flag a merge failure in that case, but will sometimes instead just randomly order the lines that were added.

We can then just make that a feature rather than a fault: you don't insert at 3.5, which is 3 + (4 - 3) / 2, rather you insert at 3 + (4 — 3) * rand(). And then when two people both try to insert 12 lines between line 3 and 4 independently, when you merge them together, you get 24 lines from both, in their original orders but interleaved randomly, and like that's not the end of the world, it might be legitimately better than just declaring a merge failure and harrumphing at the user.

phi-go · on Jan 25, 2025

> This sort of problem also occurs when you're trying to do CRDTs, which can roughly be described also as "design something that does Git better."

Aren't the goals of git and CRDTs different. With git you want to get the merged result to be semantically correct. With CRDTs you want to achieve convergence (so no merge conflicts), as far as I know semantically correct convergence (not sure what to correct term is) is not really possible as it is too difficult to encode for CRDTs, though. Isn't that why CRDTs are mostly used for multiplayer interactive applications where these kinds of mismatches are quickly seen?

funcDropShadow · on Jan 25, 2025

The technically correct term -- at least in reduction systems -- would be confluence not convergence.

williamdclt · on Jan 24, 2025

Ha, I had this exact problem asked as an interview question.

IIRC their real life solution was to leave gaps between elements (eg index 0, 100, 200… instead of 0, 1, 2) and re index when needed. Probably works well enough, what I came up with is (as you say) the idea of fractional indexing, but dealing with decimals is a pain so you can instead represent them as vectors, which you can just represent as a string of numbers that you sort lexicographically.

So an element inserted between elements 1 and 2 gets an index 11 (anything between 11-19 is fine). Between 1 and 11 is 101. Between 11 and 2 is 12 (or anything between 12-19). Note that these indexes are not numbers, they’re string that are compared lexicographically.

I’m sure there’s downsides, eg it takes a lot more memory to sort these indexes (strings are much larger than numbers). It feels a bit too smart to not have unexpected downsides.

thaumasiotes · on Jan 24, 2025

> IIRC their real life solution was to leave gaps between elements (eg index 0, 100, 200… instead of 0, 1, 2) and re index when needed. Probably works well enough, what I came up with is (as you say) the idea of fractional indexing

Those are the same thing. Leaving gaps is fractional indexing. It's just fixed-point rather than floating point.

> an element inserted between elements 1 and 2 gets an index 11 (anything between 11-19 is fine). Between 1 and 11 is 101. Between 11 and 2 is 12 (or anything between 12-19). Note that these indexes are not numbers, they’re string that are compared lexicographically.

This reminds me of the most interesting method of generating random integers in an arbitrary range from random bits: interpret the bitstream as a string representing a real number (in binary) between 0 and 1. If, for example, you want to use bits to generate a number between 0 and 5, you divide the unit interval into sixths, and examine bits until you've determined conclusively that your number lies within one of those intervals:

     +---- 0 = 0.000000000... ---+
     |         0.000 -> 0        |
     |         0.00100 -> 0      |
     +-- 1/6 = 0.001010101...  --+
     |         0.0011 -> 1       |
     |         0.0100 -> 1       |
     +-- 2/6 = 0.010101010...  --+
     |         0.011 -> 2        |
     +-- 3/6 = 0.100000000...  --+
     |         0.100 -> 3        |
     +-- 4/6 = 0.101010101...  --+
     |         0.1011 -> 4       |
     |         0.1100 -> 4       |
     +-- 5/6 = 0.110101010...  --+
     |         0.11011 -> 5      |
     |         0.111 -> 5        |
     +---------------------------+

WalterBright · on Jan 24, 2025

> solution was to leave gaps between elements (eg index 0, 100, 200… instead of 0, 1, 2)

Reminds me of my days writing BASIC programs.

Etheryte · on Jan 24, 2025

Very similar to my first intuition on how to approach this problem. An interesting question that comes to mind is how should you pick index sizes and how often should you rebalance the whole thing. Make the index too large and you're wasting a lot of space you'll never need, too small and you're forced to rebalance too often. I'm guessing an ideal index size is such that you can rebalance at most once a night with a cron job and then avoid rebalances the whole rest of the day.

To be fair, this sounds like one of those classic problems that someone for sure already figured out in the 50s or 60s, just under a different name and/or context. Hash chaining is a similar idea, but not quite the same.

macNchz · on Jan 24, 2025

The "real life" solution of leaving gaps & reindexing is actually my earliest programming memory (as a teenager)/lesson in cleverness, of feeling like I should have been able to come up with a more clever/optimal/~~scalable~~ solution but settling for this awkward 100 200 300 thing. Nowadays I've used that approach like...dozens of times over the decades of real world projects since, and I'm still maintaining that web app I made in 2003, so I'm very glad I never came up with something unduly clever.

zeroonetwothree · on Jan 24, 2025

That is how I implemented it at work around 9 years ago. Use strings for ordering and if you use the full byte range they end up fairly compact (rather than just 1-9 as in your example). There are some edge cases that can cause it to balloon in size so there is a separate reordering step but it almost never needs to be invoked.

jes5199 · on Jan 24, 2025

like line numbers in an old BASIC program

funcDropShadow · on Jan 25, 2025

Theoretically, using fractionals as list labels requires unbounded amounts of memory to store the fractions. In practice that is extremely limitted. But the difference becomes really a problem. If you are not just assigning ordered labels to a collection, but if you want to use these labels directly as array indices to store the elements. That would be a more literal modelling of the library sorting problem.

tetris11 · on Jan 24, 2025

Isn't that hash table chaining?

jon_richards · on Jan 24, 2025

More like tree rebalancing.

globular-toast · on Jan 25, 2025

A linked list? I guess you are forgetting to mention other requirements.