But why it even happened? Typically you run Flume over docjoins in your project’...

KMag · on May 6, 2022

This was in the Borg days, pre-Flume. As I remember, we had at least one Borg cluster to ourselves for indexing. Our quota was a cluster.

We cared because indexing latency is important. Engineers get paged and woken up in the middle of the night when indexing latency gets high. Doubling CPU usage meant doubling indexing latency. The pre-Percolator indexing system ran roughly 1,000 threads per box on roughly 1,000 Warp18 boxes. Of course, most of those threads were blocked on I/O. (Warp18 was something like 16 cores per box... I think those were the RevE Opterons that had the memory fence bug that caused the glibc semaphore implementation to not actually synchronize operations.) I personally would have written things using more async and/or non-blocking I/O, but the main reasoning in Google at the time was that massively multithreaded blocking I/O is a simpler programming model and closer to what was being taught in school those days.

Also, the indexing system's quota is huge. Someone made a mistake in their benchmark program that they thought got maximum utilization out of the Warp18 hardware. The indexing system actually got higher utilization than the hardware stress-test, resulting in cooling being under-provisioned in the Atlanta datacener (more than a decade ago, I'm sure indexing has moved now). There was a couple day stretch one hot Summer where we needed to stay on the phone with staff inside the Atlanta DC to keep track of the temperature inside the datacenter. When the temperatures started to get above something like 80 F, we needed to dial back on the number of workers in our Borg jobs.

While testing the prototype for the Percolator indexing system, we were luckily sitting right next to an engineer who planned datacenters. We luckily overheard him saying he couldn't figure out why the power usage seemed to be swinging so wildly. It was literally the power usage of a decent sized town suddenly ramping up and down. We asked him the exact times power usage was ramping up and down, and figured out it was when Frank and Daniel were starting and stopping the Percolator prototype.

My best guess is that the indexing system right before Percolator was a bit more energy efficient than Percolator, but maybe not. In any case, doubling CPU usage was a big deal. Real money, and in Atlanta, probably real carbon footprint. (The Dales was hydro, but I don't think much of Atlanta's power was renewable at that time.)

exikyut · on May 6, 2022

Regarding the Percolator prototype's energy usage, that's a really interesting scheduling and coordination problem. I'm curious how you solved that - perhaps by teaching Borg to correlate CPU load with physical energy usage and stagger task initiation.

It sounds like it would be even cooler to coordinate with the local electricity suppliers to spin up and shed extra load a few minutes in advance, but I'm not sure if this would actually have a practical benefit.

caffeine · on May 6, 2022

Hey, I just wanted to say that your comments in this discussion are awesome and I’m having a lot of fun reading them. Thanks!

the-rc · on May 6, 2022

> As I remember, we had at least one Borg cluster to ourselves for indexing. Our quota was a cluster.

> Also, the indexing system's quota is huge.

Well, it might no longer be the case, but for a number of years the indexing cluster was the largest in the entire fleet, especially when it came to memory. I don't know if you were around in the days of yl-i (I think), but it was edifying to rank it against the Top500...

Was the datacenter engineer Peter P?