Hacker News new | past | comments | ask | show | jobs | submit login

Nothing is wrong with the k8s scheduler!

Really, nothing is wrong with k8s at all, beyond our more general problem of "our users want to run Linux apps, not k8s apps".

K8s, Borg, Omega, Flynn, Nomad, and to some extent Mesos all share a common high-level scheduler architecture: a logically centralized, possibly distributed server process that functions like an allocator and is based on a consistent view of available cluster resources.

It's a straightforward and logical way to design an orchestrator. It's probably the way you'd decide to do it by default. Why wouldn't you? It's an approach that works well in other domains. And: it works well for clusters too.

The point of the post is that it's not the only way to design an orchestrator. You can effectively schedule without a centralized consistent allocator scheduler, and when you do that, you get some interesting UX implications.

They're not necessarily good implications! If you're running a cluster for, like, Pixar, they're probably bad. You probably want something that works like Borg or Omega did. You have a (relatively) small number of (relatively) huge jobs, you want optimal placement†, and you probably want to minimize your hardware costs.

We have the opposite constraints, so the complications of keeping a globally consistent real-time inventory of available resources and scheduling decisions don't pay their freight in benefits. That's just us, though. It's probably not anything resembling most k8s users.

In fact, going back even before Borg but especially once Borg came on the scene, mainstream schedulers have been making this distinction --- between service jobs and batch jobs, where batch jobs are less placement sensitive and more delay sensitive. So one way to think about the design approach we're taking is, what if the whole platform scheduler thought in terms of a batch-friendly notion of jobs, and then you built the service placement logic on top of it, rather than alongside it?




I wonder if you ever looked at sparrow paper[1] which came out Ion Stoica's lab. Its also a decentralized in nature focusing on low latency placement.

[1] https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf


Lot of similarities!

At a low level, there are distinctions like Sparrow being explicitly p2c based (choose two random workers, then use metrics to pick the "best" between them). We don't p2c at all right now; part of our idea is being able to scale-from-zero to serve incoming requests with sleeping Fly Machines, rather than keeping everything online. And, of course, we run whole VMs (/Docker containers), not jobs with fixed definitions running on long-running executors --- but that's just a detail.

My sense of it is that the big distinction though is that you'd run Sparrow alongside a classic scheduler design like Mesos or Borg or Omega. You'd schedule placement-sensitive servers with the Borglike, and handle Lambda-style HTTP request work with the Sparrowlike.

What we're working on instead if whether you can build something sufficiently Borg-like on top of something Sparrow-like, using "Sparrow-style" scheduling as a low-level primitive. This doesn't quite capture what we're doing (the scheduling API we export implicitly handles some constraints, for instance), but is maybe a good way to think about it.

I should have cited this paper! Thanks for linking it.


I have a hard time being on board with the logic. Neither k8s nor mesos guarantee optimal placements (which as you pointed out is unrealistic given the NP-hardness of the problem), in fact they are explicitly trading placement quality for lower scheduling latency - and are mostly designed to schedule really fast.

Mainstream schedulers make no distinction between service & batch jobs out of the box - you need to explicitly go out of your way to implement differential preferences when it comes to scheduling latency for example.

I am surprised cost isn't a concern for you guys, I actually had assumed you went all in on bin-packing+firecracker oversubscription to maximize your margin.


Nothing guarantees optimal placement, but all the mainstream schedulers attempt an approximation of it. The general assumption in mainstream schedulers is that servers are placement-sensitive, and batch jobs aren't. If you assume there's only one kind of job (all servers, or all batch jobs), the design of a scheduler gets a lot simpler; like, most of what the literature talks about with respect to Mesos and Borg is how to do distributed scheduling for varying schedule regimes.

I like how I worded it earlier: instead of having a complex scheduler that keeps a synchronized distributed database of previous schedulings and available resources, what if you just scheduled everything as if it was a placement-insensitive batch job? Obviously, your servers aren't happy about that. But you don't expose that to your servers; you build another scheduling layer on top of the batch scheduling.

That's not precisely what we're doing (there are ultimately still constraints in our service model!) but it sort of gets at the spirit of it.

As for cost: we rent out server space. We rack servers to keep up with customer load. The more load we have, the more money we're making. If we're racking a bunch of new servers in FRA, that means FRA is making us more money. Ideally, we want to rack enough machines to have a decent amount of headroom past what we absolutely need for our current workload --- if our business is going well, customer load is consistently growing. So if ops is going well, we've generally got more machines than a mainstream scheduler would "need" to schedule on. But there's no point in having those machines racked and doing nothing.

At some point we'll reach a stage where we'll be more sensitive to hardware costs and efficiency. Think of a slider of sensitivity; we're currently closer to the insensitive size, because we're a high-growth business.


> Nothing guarantees optimal placement, but all the mainstream schedulers attempt an approximation of it. The general assumption in mainstream schedulers is that servers are placement-sensitive, and batch jobs aren't.

I don't know the open source schedulers well, but modern Borg batch scheduling works quite differently than (my understanding of) your description. Batch jobs are still often placement-sensitive (needing to run near a large database, for example, for bandwidth reasons). The big distinction I see is that where serving tasks get tight SLOs about scheduling latency, evictions/preemptions, and availability of resources on the machine it's scheduled on, batch tasks basically don't. They can take a while to schedule, they can get preempted frequently, and they cram into the gap between the serving tasks' current usage and limit. E.g., if a serving job says it needs 10 cores but is only using 1, a batch job might use 8 of those then just kinda suffer if the serving job starts using more than 2, because CPU is "compressible", or it will be evicted if things get really bad. In the same situation with RAM (mostly "incompressible"), the batch job gets evicted ASAP if the serving job needs the RAM, or the system involves some second-class RAM solution (cross numa node, Optane, zramfs, rdma, page to SSD, whatever). Batch doesn't get better service in any respect, but it's cheaper.

> As for cost: we rent out server space. We rack servers to keep up with customer load. The more load we have, the more money we're making. If we're racking a bunch of new servers in FRA, that means FRA is making us more money.

and in the article, you wrote:

> It was designed for a cluster where 0% utilization was better, for power consumption reasons, than < 40% utilization. Makes sense for Google. Not so much for us.

IIUC, you mean that whoever you're renting space from doesn't charge you by power usage, so you have no incentive to prefer fully packing 1 machine before scheduling something on every machine available. Spreading is fine. Makes sense economically (although I'm a little sad to read it environmentally because the power usage difference should still be real).

I think another aspect to consider is avoiding "stranded" resources. Avoiding situations in which say a task that needs most/all of a machine's remaining RAM but very little CPU gets scheduled on a machine with a whole bunch of CPU available, effectively making that CPU unusable until something terminates. You've got headroom, but I presume that's based on forecasted need, and if that gets higher because you'll still have stranded resources when that need comes, the stranding is costing you real money.

Maybe this problem is avoided well enough just by spreading things out? or maybe you don't allow weird task shapes? or maybe (I'm seeing your final paragraph now about growth) it's just not worth optimizing for yet?


> Makes sense economically (although I'm a little sad to read it environmentally because the power usage difference should still be real).

Does fully loading 4 cores in one server save power over fully loading 2 cores in 2 servers? If you turn off the idle server, probably yes? If not, I'd have to see measurements, but I could imagine it going either way. Lower activity means less heat means lower voltage means less power per unit of work, maybe.

You're likely to get better performance out of the two servers though (which might not be great, because then you have a more variable product).


In modern CPUs with modern thermal management approaches, it's probable that fully loading two cores in two servers is much more efficient than even powering off the second server, because in each machine the primary delta in power draw between idle-state and max-load is in thermal management (fans), and running cores more distributed will increase passive cooling, as well as allowing the CPU cores that are in use to run in more energy-efficient modes.

That said, I haven't done the actual math here, just seen the power draw benchmarks that show idle -> single core draw -> all core draw as a curve with idle and single core usage well under the ratio of number of cores, without even accounting for the fact that each core is more performant under single-core workloads.


> Does fully loading 4 cores in one server save power over fully loading 2 cores in 2 servers?

That's the premise, and I have no particular reason to doubt it. There are several levels at which it might be true, from using deeper sleep states (core level? socket level?) to going wild and de-energizing entire PDUs.

> You're likely to get better performance out of the two servers though (which might not be great, because then you have a more variable product).

Yeah, exactly, it's a double-edged sword. The fly.io article says the following...

> With strict bin packing, we end up with Katamari Damacy scheduling, where a couple overworked servers in our fleet suck up all the random jobs they come into contact with. Resource tracking is imperfect and neighbors are noisy, so this is a pretty bad customer experience.

...and I've seen problems along those lines too. State-of-the-art isolation is imperfect. E.g., some workloads gobble up the shared last-level CPU cache and thus cause neighbors' instructions-per-cycle to plummet. (It's not hard to write such an antagonist if you want to see this in action.) Still, ideally you find the limits ahead of time, so you don't think you have more headroom than you really do.


No, it's not that power usage for us is free, it's that the business is growing (like any startup), so there is already a constant expansionary pressure on our regions; for the foreseeable future (years), our regions will tend to have significantly more servers than a scheduler would tell us we needed. Whatever we save in power costs by keeping some of those servers powered off, we lose in technical operational costs by keeping the rest of the servers running artificially hot.


Actually, on thinking about it a bit more, it has a lot of similarities to a whole bunch of coordination problems that I worked on many years ago in the mobile phone industry. There was a problem was coordinating resource usage between a set of mobile phone base stations.

In the OP, you talked about a payoff curve for an individual server taking on a job based on its current loading. This is effectively a metric for the desirability of that server taking on another job, and that in turn would allow the supervisor to auction off jobs, assigning each job to the server which will benefit most from taking it.

You could even go one step further and implement a genetic algorithm classifier system. Each server would have a rule set for bidding for jobs coded it into the GA genotype. Every once in awhile a lower priority coordination process runs a generation of the GA, replacing the rule set on some of the less efficiently running servers with the offspring of the other more efficiently running servers. In this way the population of surfers evolves towards rulesets for bidding for jobs the drive the efficient use of your server resources.


> Think of a slider of sensitivity; we're currently closer to the insensitive size, because we're a high-growth business.

If the big-tech layoffs are any indication of a general trend toward belt-tightening, then reducing headroom, and even over-subscribing servers shared-hosting-style, must be tempting.


Even in a case where they for some reason might want to oversubscribe or at least keep the number of servers smaller the scheduling decision they want seems to be different.

It's the difference between "fit this into the smallest number of nodes so that we can shut down some nodes [possibly rented by the hour] if possible" vs "these nodes are sitting there whether or not they're doing anything, so spread the nodes across them".

In the long run there may come a time where packing tighter matters, e.g. if they become big enough to run their own data centres, where shutting down servers dynamically to save power might become worth it, but typical colo contracts for smaller number of racks rarely makes it worth your while to turn servers off.


Not for us, right now. We have if anything the opposite problem.


The whole thing really comes down to a tension between on the one hand trying to find a globally optimal solution, which requires lots of state coordination, and locally optimal solutions, which are unlikely to be as good, but do not have the coordination overhead.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: