A funny and somewhat off topic story -- back in 2003, before the Google IPO, Google was doing a recruiting event at Berkeley. They brought a few of their folks with them: their founder Larry, one of their female engineers, Marissa, and some others. They did a little talk, and during the Q&A, professor Brewer told Larry that there was an opening in the PhD program and he was welcome to it. Larry politely declined.
Afterwards I asked Larry, "so, do you think you'll ever finish your PhD, either here or at Stanford?". He said, "If this Google thing doesn't work out I might, but I have a feeling it will work out ok."
It amuses me that Professor Brewer is now working for Larry. :)
I worked as a contractor at Google in 2013 and loved their infrastructure. It was amazing to fire off a Borg job that used hundreds to thousands of servers, and the web based tools for tracking the job, fantastic logging to drill into problems, etc.
And, Borg was two generations ago!
Even though I am very happy doing what I am now, sometimes I literally wake up in the morning thinking about Google's infrastructure. I now use lesser but public services like AppEngine, Heroku, nitrous.io (like Google's web based IDE Cider, a bit) but it is not the same.
BTW, not to be negative, but while Google is a great home for someone like Eric Brewer, it is a shame that many hundreds of future students at UC Berkeley will not have him as a professeur.
How hard was it to learn these tools as a contractor? I hear that Google is more aggressively making their tools open-source to help the portability of their engineers' skills. Specifically, friends have told me that onboarding at Google is difficult because every tool is a proprietary one built on top of more proprietary systems. In addition, people who leave Google have trouble interviewing because they rely heavily on tools that are unavailable outside of Google. By beginning to contribute more to open-source, I think that it has the potential to make interviewing at and joining Google a smoother experience.
Good question. I found onboarding at Google really fun and interesting. I took several great classes, the best being the end-to-end class that in 8 hours let you write code that used most of Google's infrastructure - that class was so awesome that I would have payed Google for that day at work :-)
Also, they had code labs that are self paced modules for learning specific tech. I didn't spend much time at work doing code labs, but I could access them at home with my corporate laptop and I went through about a dozen of them at home. One other nice thing is even mentioning that you couldn't figure out something from the documentation or code labs would cause someone to jump in to help you.
Also, I don't think that people leaving Google have problems getting other jobs :-) The retention rate at Google is surprising low, given the pleasant atmosphere there. People leave to go elsewhere, start their own companies, etc. I was 63 when I worked there, and although it was probably not the most awesome place I worked, it was really great.
Apply for a job if you are interested, or go the easier route and get a contractor position.
Onboarding at Google was great; Unlike similarly-sized and oft-mentioned competitors, Google has a set of really good classes that take you through the various systems, showing their design, reasoning, historical anecdotes, etc. Each of these has a more in-depth version, as well as a couple of classes called "Searching Shakespeare" that are designed as a two-day full-stack quickie to build something that uses a little of ever piece.
I heard from a Googler that many of other companies' public tools (databases, job schedulers, etc.) were often created by ex-Googlers who missed what they were using at Google.
One thing that bothers me about the article is that it shows a recurring problem: IT not knowing what it knows. The NoSQL movement didn't notice that NonStop Architecture scaled linearly to thousands of cores with strong-consistency, five 9's, and SQL support. In the mid-80's. Instead of making a low-cost knockoff, like cluster movement did for NUMA's, they ditched consistency altogether and launched NoSQL movement. Now, I see man who invented CAP theorem discuss it while referencing all kinds of NoSQL options to show us the tradeoffs. Yet, there's Google services in production and tech such as FoundationDB doing strong consistency with distributed, high throughput and availability.
So, why aren't such techs mentioned in these discussions? I liked his explanation of the partitioning problem. Yet, he and NoSQL advocates seem unaware that numerous companies surmounted much of the problem with good design. We might turn CAP theorem into barely an issue if we can get the industry to put the amount of innovation into non-traditional, strong-consistency architectures as they did into weak-consistency architectures. There is hope: Google went from a famous, NoSQL player to inventing an amazing, strong-consistency RDBMS (F1). Let's hope more follow.
> The NoSQL movement didn't notice that NonStop Architecture scaled linearly to thousands of cores with strong-consistency, five 9's, and SQL support.
Also incredibly expensive. Take this case, Google took off because they were able to scale-out with off-the-shelf hardware, compared to the millions banks were pouring in for scale-up configurations which handled much less load. Scale-up can quickly hit hard limits, before it becomes exponentially expensive to continue on the path. This is true even today.
NoSQL movement, if you wanna call it that, took off because most apps (including Google, including Financial Services, even Health Care) don't need some types of consistency offered by relational databases. Many large apps are significantly de-normalized and have many foreign-key less tables, often filled up by scheduled jobs. That's fine for most apps; NoSQL architectures recognize that and users consider that in design.
For the majority of use-cases out there, NoSQL databases offer enough consistency. For the remaining use-cases, there are tools available in NoSQL databases to make them work, though it requires a bit of work.
> For the majority of use-cases out there, NoSQL databases offer enough consistency.
For applications with a nontrivial data model, ensuring that each logical operation only does a single document update (or that multiple nontransactional updates cause no conflict in maintaining consistency in the presence of other logical operations) is actually really challenging - and it adds a substantial design overhead to every new feature added. I think you're being extremely optimistic in your assertion that NoSQL systems are that widely safely applicable. My experience has been that NoSQL-based systems stay mostly-consistent because they happen to experience very little concurrent activity, not through understanding/design on the users' part.
This is not to make light of the situations where NoSQL systems shine, but the idea that higher levels of consistency are rarely useful does not match my experience at all.
> NoSQL movement, if you wanna call it that, took off because most apps (including Google, including Financial Services, even Health Care) don't need some types of consistency offered by relational databases.
I'd say that Mongo (for example) took off because:
- They really nailed the setup experience for new users (which RDBMSs historically sucked at).
- The data model is much easier for simple apps.
- They had some fairly creative techniques for making their system look good - unacknowledged writes, and claims of support for replication which didn't really fully work.
- Most programmers don't really understand the ins-and-outs of maintaining consistency.
In startup mode, I expect companies to do whatever works. Google's initial decisions were correct. Later, they dedicated expensive teams to inventing brand new solutions for about every aspect of their operation. That's when my critique apples. For instance, they might have looked at expensive NonStop, identified what they could copy with COTS tech, and made a cheap knockoff. They could've done the same with prior MPP or cluster tech which already scaled to thousands of nodes with open source management tools and microsecond latency. Much more efficient than web stacks, too. The one example I saw of them copying and improving past methods was Google File System. As I predicted, that good choice laid foundation for awesome stuff such as BigTable, Spanner, and F1.
NoSQL movement's origins are highly debatable. Here's what I saw in its beginning: an explosion of articles on the subject after a few success stories about big companies doing massive scale on cheap servers. Zealots argued that mature, strong-consistency solutions had problems. So, instead of fixing those, we should just ditch them and strong consistency for The New, Great Thing. The only time I even saw a real analysis of cost-benefits was a few articles which focused on a narrow class of applications where data consistency didn't really matter. So, my conclusion was that the movement was two things: 95% a social phenomenon that happens with each IT fad; 5% a result of weaknesses of relational model and tools. That last part makes sense and is why I've always opposed RDBMS's.
On your last point, do you have a citation that shows weak consistency databases offer enough integrity for "majority of use-cases?" I thought the majority of use cases for databases were protecting data that's important to a business: orders, billing, HR, inventory, customer information, and so on. I usually just tell them to use PostgresSQL and that covers it with high reliability. If you really can back your claim, though, I'll be glad to start transitioning my R&D toward designs that trade against integrity of data.
>> Many large apps are significantly de-normalized and have many foreign-key less tables, often filled up by scheduled jobs. That's fine for most apps; NoSQL architectures recognize that and users consider that in design.
There's a boat load of assumed knowledge in this quote, how likely is it that someone not familiar with rdbms would know what a foreign key is, for example? Not very likely I think. I posit you give developers too much credit.
> There is hope: Google went from a famous, NoSQL player to inventing an amazing, strong-consistency RDBMS (F1). Let's hope more follow.
The problem is at a large distributed scale forcing consistency is basically fighting with the laws of physics. Is something in Australia always consistent with something in US? Well "same time" is a funny thing in physics because it doesn't actually exist. So we have to work really hard to keep consistency going.
To illustrage, here is how Spanner (F1) works:
---
Spanner's 'TrueTime' API depends upon GPS receivers and atomic clocks that have been installed in Google's datacentres to let applications get accurate time readings locally without having to sync globally.
---
I call that a cludge in the general sense, not amazing future technology. Yes if you can afford to install GPS antennas on your datacenters you can handle it and it is ok. But it is a crutch. "Well but there is NTP one might say say". Yeah there is, and connectivity to that fails as well.
The one interesting research area though is CRDTs. These are datatypes that know how to auto-converge to a known value even if they experience temporary inconsistency. So you basically experience temporary inconsistency but it fixes itself.
I was pleased when I read the spanner paper to see that they had gone to locally authoritative stratum 0 clocks. I long time ago in a different universe it seems, I struggled with the notions of replcated naming logs which needed to converge to a common and consistent results. It wasn't until I got to play with the combinator technology at Blekko did I feel like there was a good answer to the problem. Clouds of idempotent expressions of execution don't need clocks at all, if you have them all, you have your answer. That was a neat result.
Fighting the laws of physics is a great comparison. Every chip designer trying to create the illusion of a sequential (or multicore) machine with a specific memory model fights the laws of physics. So do people building bridges, designing planes, and trying to maintain space stations. The greatest weapon we have after human ingenuity is using good engineering principles built on what we've tried and learned.
So, back to this current example. Synchronizing time was a huge problem for consistency. They had expensive, highly-custom datacenters throughout the world. Yet, there wasn't even a solution worth buying that wouldn't cost a fortune or need new infrastructure. An engineer noticed that a time-source existed which all datacenters could sync with using affordable, COTS equipment. One among others. So, they used that to solve the critical problem, solved other problems with other technologies, and integrated the resulting components into a solution to their real problem (F1).
What I've just described is not a hack: it's Engineering 101. Identify the problem(s), look for known solutions to it, adapt them to your needs, and deliver the solution. Their use of GPS to solve a problem that otherwise would cost millions of dollars to solve is exemplary engineering. Given datacenter costs, using this at each one would barely be a blip on the budget sheet and will get easier to deploy as adoption increases (network effect).
I don't see how using GPS receivers for a time index is "a kludge". It seems like an extremely sensible and practical solution to the problem. Hey, turns out we've already got this satellite constellation broadcasting an extremely accurate time signal!
It's a hack, in all the positive senses of the word.
It is a kludge because it adds the whole GPS system into a the mix of a set of software components. So now Spanner/F1 is not just a "apt-get install" way it is an apt-get install + buy server GPS receiver + make sure to install antenna on roof + make sure it is not bent or knocked down.
So what? Google are already managing thousands or hundreds of thousands of different hardware and software components. IMO, using GPS is a good solution to a hard problem.
You have some interesting points, however Eric Brewer isn't just 'some Google guy discussing CAP'. He actually invented the theorem... It's also known as 'Brewer's Theorem'.
FWIW, CAP is not about NoSQL or the 'NoSQL movement', it's about distributed systems and distributed shared memory, which applies to a whole range of computing problems.
I've edited my comment to give him that credit. I understand the CAP theorem applies to many things. The reason I tied NoSQL in is that it's often cited as the reason people traded away strong consistency. Yet, there were strongly-consistent setups with the desirable properties in production and in academia. He and others rarely mention them in such discussions. It's why I tied them together.
> NonStop Architecture scaled linearly to thousands of cores with strong-consistency, five 9's, and SQL support.
Admittedly so, but at a very, very high price. Similarly, Sun and SGI had amazing technology in the server and workstation space (after Solaris 2.3, anyway), but over time Linux became "good enough" and we became willing to sacrifice Sun's niceties to save millions per data center.
The mere existence of technology isn't enough; it has to be affordable - and rational managers will have to make cost/benefit decisions that suit their goals.
True. I always expect this. Yet, Google doesn't fit that profile because they weren't just buying something that works: they were investing countless sums into geniuses and their projects trying to invent new things. It's at that point I expect companies to learn from the past and do better.
NonStop and its ilk involve complex hardware engineering, not merely software. But software is Google's forte, not hardware. I wouldn't expect Google to reinvent legacy HA systems, if only because engineering HA systems like those isn't their business.
Software is their main skill. Yet, Google and Facebook have both been doing custom hardware for datacenters for years now. They're also funding academic R&D on chip design and such. Such companies needs also created an ecosystem offering things OpenFlow switches that are custom hardware implementations of radically different software. Intel and AMD are also both doing custom work for unnamed, datacenter companies. It's hearsay but I'd expect top players to be involved. So, they're already all well into the hardware business.
The simple route, as I indicated, would be to copy, contract, or even buy a MPP vendor. In academia, MIT Alewife showed one custom chip was all that was necessary to build a NUMA-style, 512 node machine with COTS hardware. Existing, shared-nothing clusters already scaled higher than Google using custom chips for interconnect. One can buy cards or license the I.P for FPGAs, Structured-ASIC's, etc. Much software for compute and management was open source thanks to groups such as Sandia. And so on. Plenty to build on that's largely been ignored except in academia.
Instead, they've largely invested in developing hardware and software to better support the inherently-inefficient, legacy software. So, they're doing the hardware stuff but just not imitating the best-of-breed solutions of the past or present. The only exception is OpenFlow: a great alternative to standard Internet tech that major players funded in academia and are putting in datacenters. Another success story is Microsoft testing a mid-90's approach of partitioning workloads between CPU's and FPGA's. So, they're... slowly... learning.
thats a pretty good comment actually. there's quite a bit of similarity.
we generally tend to jump into these as "omg awesome new tech" with a very narrow view. But it also helps boosting more though-out techs (even thus it feels less efficient to go through that route first, its perhaps the only route that works with human: try, fail, try again, etc.)
Yeah we do. My only guess is it's two things: (a) our industry is horrendous at communicating previous generation's wisdom in a usable way; (b) a social phenomenon. Quick example of the first are industry pro's locking up their good advice in obscure, expensive books and cutting edge research silo'd into ACM, IEEE, etc.
The other is a social thing that leads to the "network" effect. People flock to something for whatever reason. This builds a community (or network) that entices others to join. That also tend to forget about other things and reinvent the wheel. Example: much of current work in Web applications aims to solve problems already solved in client-server apps with better efficiency, security, reliability, and portability. Even Facebook went back to that model for mobile IIRC. Good luck convincing most Web technologists to switch to client-server, though.
Whoever solves both these problems will create ripple effects that grow innovation at a heightened, maybe exponentially better, pace. The reason will be a combination of avoiding wasted effort plus visibility into best efforts. I got ideas on Problem 1 but the best minds need to get on Problem 2: it's a gold mine if it's solved.
"Computing spread out much, much faster than educating unsophisticated people can happen. In the last 25 years or so, we actually got something like a pop culture, similar to what happened when television came on the scene and some of its inventors thought it would be a way of getting Shakespeare to the masses. But they forgot that you have to be more sophisticated and have more perspective to understand Shakespeare. What television was able to do was to capture people as they were. So I think the lack of a real computer science today, and the lack of real software engineering today, is partly due to this pop culture." [1]
Thanks for that awesome article! Interestingly, he used the same example I've been using in INFOSEC discussions online: Burroughs B5000. It was simply brilliant and still better at the core than anything I use today. Let me explain.
There have been countless reliability and security problems that occur due to buffer, pointer, data becomes code, and interface errors. These are about 99% of worst problems. They happen because underlying Intel/IBM/RISC architecture treats all data the same... mostly. Plus, the systems languages (C/C++) are fundamentally broken far as preventing errors. The Burroughs team saw this [in 1961] and solved the problems at their source: CPU protection of pointers; CPU could tell code & data apart for security purposes; hardware-managed stack; CPU bounds-checked arrays; high-level language (Algol) for system code; interface types checked at compile & function-call time; hardware & software isolation of apps from OS. Good luck crashing or hacking that!
So, I've read thousands of hardware, firmware, and software solutions to these problems. Yet, very few will straight up fix the problem at its source. That's despite the existence of a proven solution since 1961 that costs a mere two bits of tagging. I'll give up a single-digit percentage of memory with single-digit performance hit to stop 99% of attacks. I'll do it today. Yet, industry's latest solutions are detecting this little tactic, hardware extensions for that, and no solution to the actual problem.
The failure of modern industry to do what Burrough's did, fix the underlying problem, is the source of most of our IT headaches. Aside from social reasons, backward compatibility with legacy is a big contributor. It's why heuristic-driven, software transformation systems such as Semantic Design's toolkit or Racket need a huge boost in R&D. Such tech is seeming like our only hope to getting legacy software onto better underlying platforms as nobody will pay for a human to understand and rewrite each codebase line for line, bug for bug.
The "whatever reason" tends to be one's resume. Some people think (and I'm not necessarily disagreeing here) the only way to stay employable is to have the absolute "blogging"edge tech on your resume.
re (a) : is not just a communication problem. a lot of solutions from previous generation are dramatically less easy than newer offerings. IMHO nosql and cloud compute offerings we're like this.
the problem with folks hating on new technology "x" is often failing to see that the new thing might only offer one significant advantage over the prior tech, or they see it, but discount it too heavily to be motivated to try it.
For (a), it's hard to say where the problem is: is it developers of new solutions who threw the baby out with the bathwater instead of fixing problems of proven methods? Or is it with demand side that's quick to jump on those new thing? Or is it demand side in the sense that they rarely fund solutions to problems with their almost-good-enough software?
I'd like a real explanation for why containers are better than unikernels. Yes, unikernals are still early, and containers are convenient, because you have all of linux there... but it seems that running several linuxes on a linux machine is a bit much. One operating system plus XEN plus several applications in unikernels seems more efficient, and more exciting.
But it's the less common choice.
I am guessing convenience is more important than the better solution that would ultimately be more just as convenient and more efficient if it gets enough eyeballs?
I'd like a real explanation for why containers and unikernels are better than regular run-of-the-mill applications running on dedicated servers. It's almost as if the wild west of the web isn't quite enough and we now need to add another explosion of layers-of-abstraction but this time on the server in order to pretend we have infinite hardware which then becomes it's own reason for existence rather than to simply run efficient software configured properly on properly utilized hardware. As if regular virtualization alone doesn't give enough headaches in trying to figure out why some subsystem does not perform.
At this rate we'll end up shipping containers as 'apps' to the clients machines with a suitable emulator at some point.
All this luxury comes at (considerable) cost and not everybody seems to be doing the math before deployment which more often than not leads to terrible efficiency.
While VMs and containers in VMs don't utilize the hardware as efficiently as a dedicated server they make your people much more efficient and happy. Unless you are running at a fairly large scale making your people more efficient gives you a much better ROI than making your servers more efficient.
Containers have actually been around for a long, long time -- and have well-known operational efficiencies. So this isn't a "shiny new layer-of-abstraction", it's a tried-and-true abstraction that has been operating in production and at scale for a the better part of the last decade.[1] That said, the developer fascination with containers (which is to say, Docker) is new, and there is a bit of a wild west of abstraction around up-stack abstractions -- but that confusion shouldn't be conflated with the abstraction of OS-based virtualization, which remains a clear improvement over HW-based virtualization and the next logical step function in infrastructure deployment.
How do you see this delivering on the security component of the isolation? (Not that VMs are perfect in this respect but it seems to me that containers are a lot less solid)
Well, you need to get specific. Speaking for SmartOS[1], we've been running containers in production for over a decade; while security is never solved per se (that is, there is always the possibility that defects will result in future vulnerabilities), the reality is that there is a lot of experience running this system in multi-tenant, internet-facing production and that CVEs against the Solaris-based zones technology have been few and far between -- and I would imagine that the same can be said of FreeBSD Jails. These two technologies stand in sharp contrast to the Linux "container" technology (which is to say: namespaces), which is much more relatively immature and doesn't necessarily share the same design constraints as zones and jails. So if by containers you mean zones or jails, the security component of the isolation is well understood and in hand; if by containers you mean Linux namespaces, then yes, a "lot less solid" is probably phrasing it generously.
This is definitely true, and to be honest, it's something of a mystery to me why the OpenVZ work has been essentially a second-class citizen for that decade. If Linux had taken path lit by OpenVZ (which is to say, if Linux had taken back the OpenVZ changes), the security gap between Linux and FreeBSD/SmartOS/illumos might have been closed much more quickly -- but as it stands (with the OpenVZ work essentially discarded in favor of the much more immature namespaces), Linux isn't on a trajectory to offer multi-tenant security via containers in the foreseeable future...
Do you want to design for failover of a complex system? Then containers are your friend.
Every dedicated server that you set up has an opportunity to not be duplicated perfectly. Every system that knows about your dedicated server has an opportunity to hard code what it shouldn't. Which makes these a potential point of failure. Add enough of those, and you're statistically guaranteed that the careful architecture that you have for failover is a pipe dream.
If everything is deployed with containers and discovery and the correct provisioning, then dealing with the fact that containers move around forces you to solve all of your other problems. And containers provide an abstraction layer that makes the rest of it straightforward.
Let me illustrate with an example.
When I worked at Google in 2010, I remember reading an article from eBay about how they finally manage to transition everything off of a running data center, without interrupting live traffic, and how much planning it took them. And they were congratulating themselves on what a heroic feat they had managed. Most companies today would still consider that a pretty amazing feat, and would find that challenging.
At the same point time I was learning how things were set up at Google so that you could drop any data center at random with barely any interruption of live traffic, and with no manual intervention required. And Google occasionally does this without warning to important data centers just to be sure that it works.
I'll definitely concede the 'mistakes are made' point because I recently spent a month untangling 8 supposedly identical VMs which really weren't but it took a lot of sleuthing to spot the differences. Oh, and to make life more interesting, not all those differences were made to the configuration files (making the systems not quite reboot safe), quite a few of them were only made to certain kernel settings in /proc and /sys (who thought that we should have two of those systems anyway...?).
I like the test-the-panic-button attitude that google brings to these things, I've yet to get someone to accept my challenge to power down their supposedly automated fail-over solution, it's supposed to work but they usually can't be sure or management would surely not allow such a rash thing as a live test and it's bad form for me to then walk up to the switchboard and trip the breaker. Verry tempting...
I'd strongly recommend reading the Borg paper (http://research.google.com/pubs/pub43438.html). It outlines in clear and persuasive detail why Google started using containers. Their reasons may or may not be convincing for smaller companies, but I'd expect the advantages to start accruing above ~10 developers.
(I am reading through the comments trying to work out if I should learn Docker. I already know how to use a virtualenv and I already know how to use a VM.)
While I'm not familiar with python & virtualenv, the problem of that would be solved by putting rails in a container (isolating the ruby environment in a fairly easily deployable way) was solved many years ago with Gemfiles/bundler[1] that pin your dependencies, and the various ways of deploying multiple rubies. RVM does indeed suck hard, but rbenv[2] is very simple (I've used it personally to develop and deploy a few projects).
One of the preferred ways to deploy rails is (was? I did this ~2 years ago) was to check it in with git and the name/version of the ruby environment to use is stored in the file ".rbenv-version", with dependencies managed by bundler (which you point to a local gem server). Install is then 1) use rbenv/ruby-install to install basic ruby, 2) install app with "git clone", and 3) run bundler. Many tools exist to do this in one step over ssh/etc automagically.
Even better than rbenv, you can just use chruby[3] to point to any rubies you want; just check one into your project itself (or whatever) and configure the siteruby/etc load paths to point to project directories. Really, chruby just fixes up your dev environment to point to a specific ruby; you set the actual project to be self contained with known paths, just like you would do de facto in a container.
While dependency issues were a problem back in the ruby 1.8 / rails 2.x days, this
There's Bundler, which allows you to install all gems to a local path (e.g. `vendor/`) and then one simply runs `bundle exec <command to run in virtual environment>` to do the needful.
Primarily because dedicated servers are a lot less efficient. The more different things you can pack on a machine while still ensuring that the high-priority/low-latency jobs get prompt access to the resources that they've reserved, the higher overall utilization you can achieve (and hence bring costs down),
At least with the Borg containers (I'm not familiar with how Omega and Kubernetes do things), there wasn't any additional layer of abstraction - the fact that there were multiple jobs running on the same kernel wasn't hidden from those jobs (although they didn't have to be aware of it). The containers were purely used for in-kernel resource accounting and control.
> Primarily because dedicated servers are a lot less efficient
Assuming you're operating at scale I don't see why that would be the case. And if you're not, what's the point?
> The more different things you can pack on a machine while still ensuring that the high-priority/low-latency jobs get prompt access to the resources that they've reserved, the higher overall utilization you can achieve (and hence bring costs down),
Yes, that's the theory. But in practice you're assuming better static control over the situation than the operating system running multiple jobs will have over the dynamic situation. So you'll need to over-provision and then you're back to square one with your utilization or alternatively you'll under-provision and then you will run into performance issues. TANSTAAFL.
(For instance, what's to stop each container to ship another implementation of the same library as a dependency, say SSL).
Yes, I think it's safe to say that Google operates at scale.
A lot of user-facing services at Google have to be over-provisioned in order to handle the cyclical usage patterns (the daily query peak is far higher than the average for most services) and to be able to survive the loss of a datacenter or two. This results in a lot of under-utilized servers for a big fraction of the time. So by packing lots of medium and low priority jobs on those same servers (and over-committing the resources on the server), you can soak up the slack resources; in the event that the resources are needed by the user-facing service the kernel containers ensure that the all the less latency-sensitive jobs on the machine don't compete for resources with the user-facing services.
It's true that the performance isolation when there are tens of jobs running on the same machine isn't going to completely match the performance isolation of running a service on a dedicated server, even with kernel resource isolation via containers, but you have to make cost trade-offs somewhere. The number of Borg services that could justify requesting dedicated machines was very small.
And to address your other concern about the OS not having so much insight into what's going on - Borg containers consisted generally of a single process, running on the machine's normal kernel. The containerization was just for in-kernel resource accounting/isolation. (Using Linux control groups, rather than anything fancier like LXC or Xen)
I think it's a fairly safe assumption to say that Google (and FB and a bunch of other extremely large web properties) run into different problems than those that are faced on a day-to-day basis by most run-of-the-mill web companies.
Thank you for the insight into the number of processes inside a typical Borg container, so that was basically a kind of 'heavy process' rather than a complete application with all dependencies (including other processes the main one depended on) packaged in, this is something I wasn't expecting at all.
Borg containers consisted generally of a single process...
Really?
My impression was that typically you'd have a process for the service, a borgmon process for monitoring, and maybe another process to ship logs off in the background.
Developers would only think about the service process (which itself typically was a fairly thin shim in front of other services), but a borg container would have more than that going on in it.
The borgmon process would be a completely separate job on separate machines (generally with a lot fewer instances).
The logsaver would also be a separate job, although typically running co-located 1:1 with instances of the actual service job. The service and the logsaver would have access to the same chunk of disk (where the logs were generated) but otherwise they were separate as far as the kernel was concerned. (As far as Borg was concerned they were very much related, but that was at a much higher level than the kernel).
I would say containers are primarily an improvement over hypervisors / VMs in performance, not over dedicated hardware. However, they still allow for some of the capabilities of VMs, in terms of elastic reshaping of a cluster. So, if you have a static environment, it's probably not useful to you. If you have a dynamic environment, where you need to frequently repurpose particular machines, but want to do it more efficiently than with a VM, it can be very useful.
In another view, it's another approach to what many look to Chef, Ansible, and Puppet to do. Combined with something like Mesos or Kubernetes, you can quickly deploy to a heterogenous cluster, without a lot of install scripts running.
Some of the other uses cases, such as running multiple containers simultaneously on the same hardware, make less to me.
Your questions are valid, and you are correct that one size does not fit all. Frankly, the Google use case ('run lots and lots of as-yet-unwritten stuff on lots and lots of identicalish infrastructure') demands thie style of automation, but sacrifices other aspects of the final solution. This is a close use case to public cloud computing providers. However, in many cases (performance, security, etc.) it is not the most desirable model. This dichotomy is in my view one of the key points being overlooked by the dual champions of containerization, docker (non-profitable silly-valley hypetrain for LXC) and kubernetes (rethink of Google's internal container utilization seemingly in response to docker and EC2).
Here's my answer for "why": DRY. Once you've deployed hundreds of servers using the same exact Ubuntu 12.04 LTS kernel base, why not just completely abstract the OS away and focus the attention on scaling the OS services that matter? Why is that when I decide that I need to scale out, I need to copy every library of the OS and every line of code for the kernel and redeploy it every time I add a node?
> At this rate we'll end up shipping containers as 'apps' to the clients machines with a suitable emulator at some point.
That's exactly the point. Care to elaborate on the downside of such a promise?
> why not just completely abstract the OS away and focus the attention on scaling the OS services that matter?
Because it adds a layer that makes no sense unless you have very specific use cases. Though I see the point regarding people efficiency, that one makes good sense (see other comment in this sub-thread)
> Why is that when I decide that I need to scale out, I need to copy every library of the OS and every line of code for the kernel and redeploy it every time I add a node?
If you're doing it that way then you are simply doing it wrong. See: chef, configuration management and various deployment services (of which you could argue containers are one off-shoot, but they focus (imo) on the wrong level for all but the largest companies). Containers are like sandboxes with significant overhead for applications that focus on ease of deployment (but that's strange to me because I see that as a one-time cost for most of my own use cases, though I can see how that equation would change if you deploy lots of things configured by lots of different people to a single set of servers, especially if there are conflicting requirements between those deployments).
> Care to elaborate on the downside of such a promise?
That's my personal view of hell, if you don't see any downside there please ignore my vision and continue as if nothing was said.
From open, text based standards to shipping arbitrary binaries in a couple of decades. And I thought GKS was about as bad as it got ;)
> See: chef, configuration management and various deployment services
A chef script is basically the automation of "I need to copy every library of the OS and every line of code for the kernel and redeploy it every time I add a node?" I'm sorry if you didn't pick up on my implied remark. Two problems are then introduced when automating those actions: (1) it doesn't negate the fact that I need to store and deploy a 700M sized OS layer every time I want to add a node (which takes minutes, not seconds with non-containerized configs) and (2) maintaining config scripts can (not always) be painful (version control, rollbacks, etc)
> unless you have very specific use cases.
> Containers are like sandboxes with significant overhead for applications
Again, do you have experience using containers? You seem awfully dismissive ("you're doing it wrong!") in a way that suggests that you might not entirely understand how they actually work...
I'm not dismissive enough to have already committed elsewhere in this thread to re-do a bit of experimentation I did about a year ago on using containers and at the time the performance overhead was such that I failed to see the use case, but since this is a fast moving field it won't hurt me one bit to update my knowledge.
In a nutshell: running a 'standard' combo of apache and a DB server as well as some auxiliary bits and pieces inside 'containers' a year ago gave significant overhead compared to running those without the containers. I'll re-do this and I'll probably do a write-up because the subject is interesting. This comment and follow up (https://news.ycombinator.com/item?id=9567623) are by people using this tech in production right now and their experience echos mine (but they're very far down the line compared to where I stopped).
Besides that particular use case (where performance and isolation are the key components to be looked at) some interesting points have been made in this thread which has shifted my stance on container use depending on what the situation is. So I don't think it is valid to classify me as 'awfully dismissive'.
FWIW I have not used containers in production (yet) but I'll be more than happy to if I can figure out where and how they can bring me an advantage, which is pretty much how I approach all tools.
What I do on our app servers is create a new user for each app (via configuration management).
That gives me both environment separation (A needs Ruby 1.9, B needs Ruby 2.0), resource accounting on a per-app basis, and a repeatable foundation in case I need to re-deploy the server or spin up new instances.
That is only true if you use no parts (e.g. shared libraries etc.) of the host environment. That guarantee that you're not inadvertently depending on something on the host system that might change is what containers give you.
Containers are simply a generalisation of python virtualenvs or ruby rbenvs, to encapsulate the entire environment. If you asked a Python programmer "well, why don't you just buy a new laptop for each project?" you might get strange looks.
Why do you bother with an OS or a container if you want to abstract the OS away? Just PXE boot all the machines and have the application set to run as init. No need to worry about copying every library for the OS to every machine and you only require the minimal kernel that you tell it to boot with.
Sorry, but what costs do containers incur? From my understanding, the resource overhead should be exceedingly minimal (disk space would ostensibly be the largest drawback, if you don't spend time cutting out the fat. Personally, I see this as a tooling issue since fat containers are completely orthogonal to how a container really executes). I understand some of the situation with IO isn't perfect yet, but I haven't heard anyone suggest that it cannot be corrected.
Containers, by themselves? Very little. Containers, as implemented by Docker, Rkt and Systemd? Quite a bit.
Disk - the technologies used for disk isolation (save chroots) are very poor performance, and in some cases can cause resource contention between what would otherwise appear to be unrelated containers. As an exmaple, using AUFS with Node creates a situation where any containers running on the same file system can only run one at a time, regardless of the number of cores. It's silly. Device mapper, on the other hand, is just plain slow (and buggy, when used on Ubuntu 14.4).
Network: The extra virtual interfaces, natting, and isolation all come with a performance penalty. For small payloads, this manifests as a few milliseconds of extra latency. For transferring large files, it can result in up to half of your throughput lost. Worse, if you have two docker containers side by side but due to your discovery mechanisms one container uses the host device to talk to the other container, you create what is known as assymetric TCP, which can cut your performance by a fifth or more. Try it out sometime, it's entertainingly frustrating to figure out.
Security: My favorite. What's the point of creating a container for your application if you're going to include the entire OS (and typically not even bother to update it with security patches). A real simple DOS on docker boxes would be to get the process to fill the "virtual" disk with cruft. You'll impact all running processes, the underlying OS (/var/lib/ is typically on the same device as /), and create such a singularly large file that it's usually easier to drop the entire thing and re-pull images instead of trying to trim it down.
Sorry if I sound down on the tech, but I've been fighting to make this work for production, and all of these little niggles are driving me batty.
I'm having much the same experience with Docker, and talking to other ops folks who get to actually put it into production, they usually have similar experiences.
Docker is fun and great when it's running on your workstation and coddled by your fingers at the terminal, but there's a lot of gotchas and missing parts when it comes to putting things into production, to be taken care of in a hands-off manner. There still isn't an easy way to centralise logs from a container app's STDOUT. Yes, there are other containers you can install to ship logs (which work for the author's use-case, not necessarily yours) or you can hack together something horrible. If you want to look at container logs, you have to have root rights. You can be in the docker group and have full control over the daemon, but the container log location is root only, and is made afresh with every container. (and don't forget to rotate those logs!)
My latest fun with docker is that one of my docker servers, built from the same source image and running on the same configuration plan in ansible as my other docker servers, fails to start docker on boot. Some sort of race condition, I assume. Basically it fails to apply its iptables rules and dies. People talk about making problems go away with docker, but it's a trope in my team that any day I'm working with docker, I'll be spamming chat with problems I'm finding in it from an ops point of view. And I'm just a midrange sysadmin :) But the point is that adding Docker adds an extra layer of debugging. The app stack still needs to be debugged, and now there's an extra abstraction layer that needs debugging.
Plus, in my particular case, there's the irony of using single-function VMs to run a docker container, which is running the same OS version as the VM :) (my devs bought into docker before I arrived...)
An OS has some pretty extensive insight into the processes that it executes, a container is an OS with a single application, so containers (assuming they take basic precautions for isolation) are not going to be able to schedule with anywhere near the efficiency that multiple processes on a single OS will.
I can see some (mostly potential at this point) security advantages but that's about it (and maybe those advantages will be enough to justify the performance overhead but containers are mostly treated as a silver bullet by the adherents and I'd like to see a bit more balance).
No. A container is just a tarball of user-space code run with some isolation. The kernel is still the kernel. Run multiple containers on a machine, and the OS manages all of their processes at once.
I'm not sure what this means, but at least in the case of Docker containers, you can see the running processes from the host OS. I.e., top shows running node.js processes, etc. If that's the case, maybe it does indeed still have some control.
And I just verified, you can kill a running process from outside a docker container. So the OS does see it and probably can do all its scheduling magic.
What happens when you install two containers each of which contain the same application, does that install only a single instance of the application binaries and are those still executed with the same efficiency (shared binary) as normally or will that trash the cache? (I understand that if multiple application install different versions of some dependency that that will lead to trouble but in a way that's something 'unfixable').
How does this perform in practice when they start talking to the outside world at or near capacity? How does it perform when they start talking to each other using some defined interface? (But presumably, no longer regular IPC).
Do you know where I can read more about this? This is the first I have heard that processes running inside containers interfere with the kernel's scheduler. Or, sorry again, but maybe I'm not following your meaning in some way?
Each 'container' is in effect an application plus all its dependencies.
So if one or more active containers could share resources then they won't, which leads to inefficiencies because you'll be running a much larger number of processes than you would otherwise (because of duplication) requiring a larger memory footprint and probably less efficient cache and/or IO utilization.
The deployment of the apps will be easier (which is a definite plus) but machine utilization will be lower and the amount of software running on a single machine will be far larger than otherwise, especially if multiple versions of dependencies are present on the same system.
A container is very much not a single process, it can contain many processes and some of those processes will likely duplicate components in other containers but without the resource optimizations that a kernel can normally perform.
Are you talking about shared libraries, that kind of deduplication?
Although true, that probably isn't really very significant compared to the vast wasted resources of idle dedicated machines. Which is hard to avoid without the vast wasted resources of a highly paid somebod(y|ies)
Yes, but then rather than shared libraries the kind of de-duplication the kernel will do when it runs multiple instances of the same binary. This is (normally) very cache and IO efficient since it is done at the VM page level.
I also don't quite understand how one can reserve CPU cycles, memory and deliver IO guarantees without the same over-provisioning that you'd have to do using regular virtualization. After all, as soon as you make a guarantee nobody else can use that which is left over, so in that respect I see little difference between virtualizing the entire OS+app versus re-using the kernel (ok, that does save you the overhead of the kernel itself but that's not a huge difference unless you run a very large number of VMs on a single machine).
But you can make guarantees to the critical jobs (up to the total size of the machine) and then let batch jobs (with less time-sensitive requirements) run best-effort in the slack.
In the event that there ends up being no best-effort resources available on a machine for a significant period of time (because all the user-facing jobs are busy and using their guaranteed resources) Borg will shift the starving batch jobs to other machines that aren't so busy.
Isn't this only true if your container build process pulls in the same version of a library in multiple different virutal filesystems? That is, if you are using the same base image for a number of applications and the libraries are installed in the base image rather than the image the application resides, in the kernel should recognize the shared libraries being used as coming from the same place and be able to perform deduplication as normal?
Presumably you could arrange things in such a way that several container images shared libraries and such but that would likely interfere with the (desirable) isolation properties and versioning will play havoc with that anyway (since all dependencies are part-and-parcel of a container and nothing stops multiple containers from shipping different versions of the same package).
Where regular virtualization runs multiple kernels (which in turn will run whatever applications you assign to them) containers appear (to me, feel free to correct me) as a way to 'share a single kernel' across multiple applications dividing each into domains that are as isolated as possible with respect to CPU, memory, namespaces and IO (including network) provisioning and allowing multiple version of the same software to present at the time without interference.
The CPU, memory and IO provisioning can be thought of as a kind of 'virtualization light' and the namespaces partitioning should (in theory) help to make things a bit harder to mess up during deployment.
Leakage from one container to another will probably put a dent in any security advantages but should (again, theoretically) be a bit more robust than multiple processes on a single kernel with shared namespaces.
So I see them as a 'gain' for deployment but a definite detriment for performance because it appears to me we have all (or at least most) of the downsides of virtualization but of course you can expect both virtualization and containers to be used simultaneously in a single installation with predictable (messy) results.
I'm really curious if there is an objective way to measure the overhead of a setup of a bunch of applications on a single machine installed 'as usual' and the same setup using containers on that same machine. That would be a very interesting benchmark, especially when machine utilization in the container-less setup nears the saturation point for either CPU, memory or IO.
Seems like you'd have to construct a pretty weird situation in order to blow up your cache, particularly if your load is enough that you can max out a server. Like, if you're running a lot of instances of the same app, deduplication should work fine, right? It only would show up if you've built a bunch of different applications that use the same libraries and consume roughly similar amounts of CPU; if you're running multiple copies of the same application, the deduplication should work just fine.
And that's assuming that it'd work exactly the way you're thinking.
I feel like the win over running VMs (which incur something like a 12% overhead compared to both Docker and running right on the machine for a single application), plus flexibility, plus ease of deployment is worthwhile. I mean, the current situation is running VM images anyway, right? This is a step in the right direction over that, even you must admit.
I wished VMs would only incur a 12% overhead (that's assuming absolutely optimal configuration and a fairly static load), it can be substantially more than that, especially when people go 'enterprise' on you for what would otherwise be a relatively simple setup.
But you've made me curious enough that I'll do some benchmarks to see how virtualization compares to present day containers for practical use cases faced by mid-size and small companies, my fooling around with this about a year ago led to nothing but frustration, it's always a risk to argue from data older than a few months in a field moving this fast and more measurements are the preferred way to settle stuff like this anyway.
The linux kernel knows what is happening in the containers. There is little performance overhead, and security is not a big advantage of the approach, VMs are better.
The linux kernel knows what's happening inside the containers but it can not de-duplicate any components that are present in multiple individual containers and it can not ensure that only one version of a package is present. This will potentially lead to cache trashing and more IO than strictly speaking required for a given workload.
Of course it does make it easier to package and deploy applications (and to ensure their correct application) but to pretend that there is no cost associated with this is simply not true.
Why can't it dedup? Presumably, containers share base images, which are read-only, and could thus be deduped in memory by the overlay filesystem implementation.
I can see how there will be less duplication than with virtualization (because you share the kernel, rather than running multiple instances of the kernel) but I can't see how duplication will be less than with dedicated hardware.
It won't - for any single given service, running on a dedicated system will probably be able to squeeze out a tiny bit more performance than running on a shared system with containers, for whatever fraction of the time that you're close to maxing out that server's performance.
But in practice (at Google-scale, anyway), that's dwarfed by the efficiency gains you can get by squeezing lots of things on to the same machine and increasing the overall utilization of the machine. Prior to adding kernel containers to Borg to allow proper resource isolation between the different jobs on a machine, the per-machine utilization was really embarrassingly low.
Another point to consider is that not all jobs are shaped the same as the machines - some jobs need more memory (so if you put them on a number of dedicated machines adding up to the total amount of memory needed, there will be lots of wasted CPU), and other jobs use a lot more CPU and less memory (so if you put them on a number of dedicated machines adding up to the total amount of CPU needed, there will be lots of wasted memory).
By breaking each job up into a greater number of smaller instances and bin-packing on to each machine, you could take advantage of the different resource shapes of different jobs to get better overall utilization.
Hang on there, are you saying you can squeeze in more applications on any given server using containers, rather than just running them in the regular filesystem? That doesn't make sense.
No, you use containers despite the fact that your hardware utilization goes down (mainly because no shared pages between applications), because your huge sprawling environment is too hard to change with flag days.
There are multiple definitions of the word 'container'. In the case of Borg, 'container' referred to the kernel resource isolation component, which is completely orthogonal to how the apps were packaged. Borg aggressively shared packages between jobs on the machine where possible.
Being able to strictly apportion resources between the different jobs on a machine (and decide who gets starved in the event that the scheduler has overcommitted the machine) means you can squeeze more out of a given server (by safely getting its utilization closer to 100%)
There are other definitions of the word 'container' that are closer to 'virtual machine' and include things like a disk image which is much harder to share, but that's not what's being discussed in the context of Borg. (Not sure about Kubernetes, that's after my time)
Its kind of like ice cream in that it looks beautiful when acquired but once you have it in your hands it tends to melt in to an unrecognizable puddle of sticky shit really quickly.
I think that the answer maybe two fold: 1) resource utilization: it is more difficult with physical servers and costs more to run and has a larger environmental footprint. 2) it saves developer time to not worry about resources, OS upgrades, etc.
I like you main point however: I would like to know, given vistualization has X% overhead, what is X?
I'm completely unfamiliar with unikernels, but I just skimmed a brief description. My first question is: Do they still incur the overhead of running against a Hypervisor like common virtual machines?
Also, possibly a clairification: Unless I'm misunderstanding what you mean by "running several linuxes on a linux machine", I believe you may be mistaken about the way containers work. Only one Linux is really running. And that is the Linux that the kernel comes from. The other stuff doesn't run unless you tell it to (so, no init, no daemons you don't specify, etcetera). Yeah the image size can be a little fat if you don't trim them down, but you can have a container nearly as small as your code is, if you statically compile. On the order of just a few bytes of overhead.
would argue that it's not just the userland. you only have the functionality you need and your bits. for example an http server would have the http stack + your app. there is no userland, there are no drivers, just your app as the only thing that is running on the machine. your app basically merges with the "kernel"
the kernel has the drivers you need. the http stack needs libc or equivalent. basically its the same except statically linked, and without debug/troubleshooting tools.
yes, really!
it does reduce the attack surface/amount of things.
The problem containers are trying to solve is not isolation of environments (though, that's a tremendously good outcome).
The problem is how do you develop microservices that are self-contained, easily composable and inherently portable. It doesn't matter if you're running on Java 1.1 on AIX 3.0 or Node on Ubuntu 26, if I have a ball of computing providing a service in Dubai, and want to move it to Ireland to take advantage of computing space that just opened up, containers make that trivial (and about a million other scenarios).
No matter what, we will want more isolation, not less.
But I sort-of agree in that we're just starting to make the transition from whole-os VMs to app containers, with a rare few going further. But cgroup/jails type isolation is lightweight enough that we can easily apply it at a much finer-grained level.
Google have 40 programmers dedicated to that project. It's still very beta btw all programmed in Go.
There's also mesos and I think you can use both in tandem since they're targeting a different thing.
Anyway if anybody is doing or thinking about containers check Kubernete and Mesos out. Also of course docker and rocket. Kubernete officially support docker and will be supporting rocket.
There are also article about how rump kernel are better than containers. Just fyi.
I have been using Amazon Web Services and other cloud platforms for over a year now, and I never really felt that VMs were the bottleneck in any way. Can someone explain to me the advantage of containers here?
I know that containers are faster because they don't virtualize the hardware, however it comes at the cost of security.
But in summary, the basic thing that containerized software offers is inherent portability and simple composition. VMs aren't going anywhere; running a container on top of them just makes everything more powerful.
Can someone explain or provide an educated guess about what is the google's strategy with kubernetes here? Surely containers are hot now and it is nice to have a stake in the game but borg has been one of their key competitive advantages. What is the profit in making an open-source alternative?
I have been wondering the same thing. I suppose it's partly because it was only going to be a competitive advantage as long as as containers where not commoditized which seems to be the direction its now heading.
Disclaimer: I work at Google on Kubernetes & Containers
I mentioned earlier - https://news.ycombinator.com/item?id=9570639 - but the biggest thing is not isolation, it's portability. You do not care what you're running on, just that you have a certain amount of disk, CPU and network available to you, and the rest you handle yourself. That gives your organization a crazy amount of flexibility in moving that ball of stuff around.
> You do not care what you're running on, just that you have a certain amount of disk, CPU and network available to you
Does Kubernetes provide those guarantees to the containers? And, if so, what does it do once those guarantees can no longer be met, say, due to neighbor growth?
Linux has copy-on-write block devices that make it possible to efficiently layer container filesystems. FreeBSD has no such thing as far as I know; the best you can do involves hard links (correct me if I'm wrong).
You are incorrect, FreeBSD has ZFS file system as a first class citizen and that allows all these things (and much more). Check out tools like iocage[1] that are using it in very user friendly manner.
Additionally, a hardlink solution (as you point out) really doesn't sound too outrageous to me. A trivial tool to write to manage immutable things like binaries. Or is there some trouble I'm failing to see (possible)?
You're right. My personal reference for unionfs is NetBSD, which doesn't include such a warning [0]; apparently freebsd's implementation needs some love.
imo, if containers are the future ( very plausible ), then, things like Aws lambda are just as plausible if only just a bit further out.
I think this is the case due to granularity of workloads and what apears to be a continuum in the workload container from metal > vm > containers > lambdas (as first class workloads).
True, lambda is DEFINITELY an advance, but it's more like the salmon at a buffet that includes steak and chicken. Some applications will need just the ability to run code (lambda), some will need defined environments (containers) and some will need total isolation (VMs/bare metal). You'll see a mix of all of these in every mature environment - some things do not fit. For example, it's super unlikely that you'd be able to run a trading app on Lamdba (needs 10GB/sec of direct network access & memory); similarly it'd be totally unnecessary to run a thumbnail processor on bare metal (though you could, of course).
good points but I'm not sure anything about the fundamental lambda architecture prevents it from being applied to high bandwidth or low latency uses.
especially if lambda functions are orchest rated to run in the metal ...think: kernel module implements lambda, then add orchestration glue up in user land
With all these container talks people forget Solaris zones which were pretty advanced sort of containers in Solaris. However Sun was honest about not mentioning zones as the solution for everything. The most important problem with containerization is that due to dependence of multiple containers on the same kernel of the host or VM OS , any kind of upgrade specially the security patching is virtually impossible to do without taking full downtime.
Afterwards I asked Larry, "so, do you think you'll ever finish your PhD, either here or at Stanford?". He said, "If this Google thing doesn't work out I might, but I have a feeling it will work out ok."
It amuses me that Professor Brewer is now working for Larry. :)