The drama with Google has to be mentioned as major part of their company history...

nomilk · on Sept 17, 2021

Genius's investigation and analysis of Heroku's load balancing claims vs their actual practices was anything but "dickish", and it wasn't a "fight".

They produced an intelligent, thorough, and fair exposè of poor engineering practices by Heroku which were otherwise not documented (or worse, misleadingly documented), and which also benefited Heroku financially, since users would have to purchase many more dynos to get the expected performance - up to 50 times more if I recall correctly.

Genius didn't have to share any of their analysis. They could have simply shown it to Heroku and probably received a bunch of "hush" benefits to forget about the whole thing. The fact they bothered to polish it and open source it allowed others to benefit from their work and meant that even 8 years' later, someone like me can better plan hosting requirements and avoid falling into the trap they did.

Yes, Heroku's load balancing works the same way to this day, resulting in much higher costs to medium-large apps as well as huge degradations to end user experience as some users will randomly be routed to dynos with long queues despite others being completely idle! This is a huge waste of energy and bad for the environment too.

As a heavy Heroku user, I'm incredibly grateful for Genius's work on this issue!

glenngillen · on Sept 17, 2021

That’s one interpretation of the events, motivations, and hypothetical alternative situations.

Or it’s simply that Heroku didn’t provide customers great visibility into a the issue at hand, that the best-in-class at the time 3rd party add-on that gave that visibility was misleading, and that architectural changes required to support a broader set of languages in a more distributed fashion introduced a change in behaviour that was not documented. The engineering best practice continues to be the same, which is why it is unchanged 8 years later: push long-running requests into an async flow. It’s both more scalable and more durable.

nomilk · on Sept 17, 2021

It's a problem for any application on Heroku that uses more than a single web dyno (i.e. any software other than tiny applications), and certainly not limited to applications with long-running requests.

Consider Genius's checkout analogy: busy markets are efficient because customers sort themselves into the shortest queues. If customers were randomly allocated queues, some customers will wait in lengthy queues while others queues remain empty!

Critically, the same logic holds whether customers have lots of items (long-running requests), or very few items - one is slightly worse, but customers are still waiting unnecessarily while other queues are free to be used (but aren't).

So it's a problem even for applications with no long-running requests. Suppose a user's request falls in a queue with 10 users being served first, while other queues are idle, also assume requests are reasonably fast - 3000ms - then you still have to wait 33 seconds (!!) for the "3000ms" request to be served (!!) - I think that will error because heroku times out requests at 30000ms. This is a disaster for UX.

Someone made a simulation to show how incredibly wasteful it is: https://gist.github.com/toddwschneider/4947326

Note that while it would be great for it to be improved, I in no way demand/expect that, I just don't think it's fair that Genius aren't given credit for their analysis - analysis that Heroku ought to have provided. It shouldn't have been Genius or other customers to expose that Heroku had misguided them, even if it happened accidentally and without intent.

glenngillen · on Sept 17, 2021

I agree on most of this. My biggest disappointment is Heroku not implementing a reasonable form of autoscaling to mitigate the worst aspects of this if it occurs. Though I’d still contend that for a site like Rap Genius at the time, 3000ms server response times were still wholly unsatisfactory and they needed to get it under 100ms. Because the likelihood of this occurring compounds significantly in proportion to request time _and_ request volume. They knew what the required solution was.

The original “intelligent” routing was possible because the router effectively had a single shared state to track all these things which was also a single point of failure. And because the state of the art of web servers for Ruby back when it was implemented was Mongrel and it was single threaded.

Add in support for nodejs (and other multi-threaded languages to soon follow) and suddenly each dyno can process multiple request concurrently and the request accounting isn’t as valuable. Make the router more fault tolerant and the overhead in trying to manage that accounting at scale becomes almost impossible without adding significant overhead to every customer request and degrading performance for everyone.

I bristle when people seem to think there was some nefarious revenue incentive from Heroku here. They were well reasoned platform changes to improve availability and increase performance for more efficient languages. In the process the marketing got disconnected from the implementation.

jayzalowitz · on Sept 16, 2021

If you are capable of making this post, you are capable of self hosting rails on something closer to the metal.