Very nice post! Another interesting direction you can take reservoir sampling is...

samwho · 2025-05-08T22:08:09 1746742089

I actually read that post on the alias method just the other day and was blown away. I think I’d like to try making a post on it. Wouldn’t be able to add anything that link hasn’t already said, but I think I can make it more accessible.

eru · 2025-05-08T23:31:33 1746747093

I have a few more topics we could cooperate on, if you are interested.

https://claude.ai/public/artifacts/62d0d742-3316-421b-9a7b-d... has a 'very static' visualisation of sorting algorithms. Basically, we have a 2d plane, and we colour a pixel (x, y) black iff the sorting algorithm compares x with y when it runs. It's a resurrection (with AI) of an older project I was coding up manually at https://github.com/matthiasgoergens/static-sorting-visualisa...

I'm also working on making https://cs.stackexchange.com/q/56643/50292 with its answer https://cs.stackexchange.com/a/171695/50292 more accessible. It's a little algorithmic problem I've been working on: 'simulate' a heap in O(n) time. I'm also developing a new, really simple implementation of soft heaps. And on my write-up for the solution to https://github.com/matthiasgoergens/TwoTimePad/blob/master/d...

> I actually read that post on the alias method just the other day and was blown away. I think I’d like to try making a post on it. Wouldn’t be able to add anything that link hasn’t already said, but I think I can make it more accessible.

If memory serves right, they don't do much about how you can efficiently support changes to your discrete probability distribution.

samwho · 2025-05-09T11:41:51 1746790911

I appreciate the offer (and your contributions in the comments here!) but collaborations are very difficult for me atm. Most of the work I do on these posts I do when I can steal time away from other aspects of my life, which can sometimes take weeks. I wouldn’t be a dependable collaboration partner.

eru · 2025-05-09T12:27:33 1746793653

No worries.

I'd mostly just appreciate a beta tester / beta reader.

samwho · 2025-05-09T12:48:53 1746794933

Totally happy to do that! You’ll find where to contact me on my homepage. :)

smusamashah · 2025-05-09T16:50:20 1746809420

I made a tool to visualize sorting algos https://xosh.org/VisualizingSorts/sorting.html where you can put your own algo too if you like.

ncruces · 2025-05-09T22:48:21 1746830901

I love the idea behind that sorting visualization, and found it extremely useful to validate the properties of my Quicksort implementation.

https://github.com/ncruces/sort

eru · 2025-05-10T02:12:14 1746843134

That's interesting. Alas, it only works for in-place sorting algorithms (and it's also an animation).

tmoertel · 2025-05-09T02:57:15 1746759435

A while ago I tried to create a more self-explanatory implementation:

https://github.com/tmoertel/practice/blob/master/libraries%2...

It is limited to integer weights only to make it easy to verify that the algorithm implements the requested distribution exactly. (See the test file in the same directory.)

eru · 2025-05-09T03:19:07 1746760747

You could probably restrict to rational numbers, and still verify? Languages like Python, Haskell, Rust etc have good support for arbitrary length rational numbers.

Each floating point number is also a rational number, and thus you could then restrict again to floating point afterwards.

dan-robertson · 2025-05-09T09:40:20 1746783620

Alias tables are neat and not super well known. We used to have an interview question around sampling from a weighted distribution (typical answer: prefix sum -> binary search) and I don’t think anyone produced this. I like the explanation in that blog. The way it was explained to me was first ‘imagine drawing a bar chart and throwing a dart at it, retrying if you miss. This simulates the distribution but runs in expected linear time’. Then you can describe how to chop up the bars to fit in the rectangle you would get if all weights were equal. Proof that the greedy algorithm works is reasonably straightforward.

eru · 2025-05-09T11:23:19 1746789799

I'm not actually sure this makes for a good interview question. Doesn't it mostly just test whether you've heard of the alias method?

Btw, a slightly related question:

Supposed you have a really long text file, how would you randomly sample a line? Such that all lines in the text file have the exactly same probability. Ideally, you want to do this without spending O(size of file) time preprocessing.

(I don't think this is a good interview question, but it is an interesting question.)

One way: sample random characters until you randomly hit a newline. That's the newline at the end of your line.

dan-robertson · 2025-05-09T17:33:41 1746812021

It’s a retired question (so I’m not really disagreeing that it wasn’t very good), and no one was expected to get the alias tables (if they did, just ask for updateable weights) and in fact there isn’t even much point in telling people about them as they can then get the impression they failed the interview. The point is more to get some kind of binary search and understanding of probability.

The Monte Carlo method you propose probably works for files where there are many short lines but totally fails in the degenerate case of one very long line. It also may not really work that well in practice because most of the cost of reading a random byte is reading a big block from disk, and you could likely scan such a block in ram faster than you could do the random read of the block from disk.

hansvm · 2025-05-09T00:04:02 1746749042

That's exactly the blog post that clicked when I put my alias method [0] together. Their other writing is delightful as well.

[0] https://github.com/hmusgrave/zalias It's nothing special, just an Array-of-Struct-of-Array implementation so that biases and aliases are always in the same cache line.

jononor · 2025-05-09T14:13:49 1746800029

Skipping like that is very interesting in battery-powered sensor systems, where you can put the system to sleep until it is time to sample.