Reflections on 10k Hours of DevOps

dboreham · on March 31, 2023

Going to dispute this one:

> The value of a CI/CD Pipeline is inversely proportional to how long the pipeline takes to run.

Priority #1 is : test your product properly. This means "if it is supposed to do something, then you should have a test to check that it does that thing".

Sometimes it just takes some time to put together and run such tests.

In my experience some people (usually my perception is they are younger, more impatient people) who have read somewhere that tests should run really quickly, will argue that Priority #1 above doesn't matter, in the interest of their patience being assuaged.

Long ago we had product builds that took hours and tests that often ran overnight, and sometimes over days. And we shipped sh*t that worked!

mjr00 · on March 31, 2023

The author isn't saying that long-running tests aren't valuable, though. They're asking the question: how much more valuable would they be if they ran in minutes instead of days?

Slow CI/CD and slow releases are a big problem because they change your entire workflow. Critical bugfix that needs to be shipped? If your CI/CD takes minutes, not a problem: fix it, test it, ship it and have a 5 whys. If it takes days, well, you either have a side-channel process where all your senior engineers have to gather around and speculate about the impact of the change, then you have them sign off, and get the VP to sign off to approve the hotfix, then hope for the best. Or you just wait for days while your customers complain on Twitter.

Tests should, ideally, run really quickly. Not at the expense of correctness or completeness, of course, but in most cases I've seen, slow tests happen because nobody bothers optimizing test code, not because they're inherently slow.

Sometimes CI/CD is slow for even dumber reasons. At one place I was at (1000+ SV-funded hot startup in the mid-10s), tests got relatively slow enough (~40 min) that VPs got involved and a "test speed task force" with a mandate of reducing CI/CD time to less than 20 minutes. This was announced at a company all-hands with a ton of fanfare. In the first week, one of the engineers did some quick glancing at the logs and noticed that the very old build scripts were still hardcoded to deploy to two servers which no longer existed... and the DNS resolution for those hosts was taking 10 minutes to time out. Needless to say, the task force accomplished its goal, though in a bit of a hollow victory.

bb88 · on March 31, 2023

The bottom line is that developers experience higher cognitive load (and therefore more stress) when writing code in untested/undertested code bases. This is where burnout starts. Unit tests and testing suites ideally will lighten cognitive load and stress, while increasing confidence.

Instead of fretting about the unknown sides effects of the code you are writing, you can be confident the unit tests will flag it if something goes wrong.

Arguably any "hotfix" (other than 3rd party zero day security patches) required is a testing escape, and ideally those should be rare. And usually there's a subset of tests that can give you 95% to 99% confidence the hotfix will work.

mjr00 · on March 31, 2023

> The bottom line is that developers experience higher cognitive load (and therefore more stress) when writing code in untested/undertested code bases. This is where burnout starts. Unit tests and testing suites ideally will lighten cognitive load and stress, while increasing confidence. > Instead of fretting about the quality of the code your writing, you can be confident the unit tests will flag it if something goes wrong.

I agree with all this. I'm not in any way saying you should do less testing because it makes CI/CD faster. I'm saying that there's a 95%[0] chance your tests are taking hours or days because they are doing something suboptimal that can be improved. Generally, it's a combination of

1. lack of overall test ownership and structure -- for example, every team maintaining their own UI integration tests and repeating an expensive login process unnecessarily, or repeatedly writing new test data to a database rather than working with an existing dataset where possible, or (worst of all) relying on shared resources like databases and servers rather than having ephemeral environments per-branch, resulting in queueing of tests;

2. people who write tests not giving a damn, and this not-giving-a-damnness accumulating over time. If you sneak in a unit test that takes 300ms when it should have taken 10ms into the build, it's likely nobody will notice. If you have a suite of 1000 tests and everyone does that, your test suite now takes 5 minutes instead of 10 seconds;

3. management and engineering leadership also not giving a damn unless it becomes really untenable. Usually devs just grumble about a slow build and nobody does anything, especially in larger corps where there's less feeling of ownership. It's very hard to have a business justification as to why you would pay people to investigate speeding up tests. By the time the test suite is taking 4+ hours, it feels "easier" to just add processes to mitigate this, like making the build server run nightly, rather than address the fundamental issue of tests being slow.

[0] There are likely specific domains where this is untrue, but rarely in the generic "CRUD software with background data processing & storage" case.

pydry · on March 31, 2023

>The author isn't saying that long-running tests aren't valuable, though.

They kind of are though, since they don't acknowledge the trade offs inherent in reducing the duration of the integration tests.

>In most cases I've seen, slow tests happen because nobody bothers optimizing test code

IME they typically they come about because realism comes with a price tag. An actual UI is more expensive to load than a mock, an actual webserver is slower than a fake web server, a postgres database spins up slower than an SQLite, a real API call takes longer than a mocked one, etc.

mjr00 · on March 31, 2023

> IME they typically they come about because realism comes with a price tag. An actual UI is more expensive to load than a mock, an actual webserver is slower than a fake web server, a postgres database spins up slower than an SQLite, a real API call takes longer than a mocked one, etc.

Sure, but spinning up a postgres database or running a Selenium test should change your test times from being executed in seconds to minutes, not seconds to hours or days.

> They kind of are though, since they don't acknowledge the trade offs inherent in reducing the duration of the integration tests.

As I say in another comment in this thread, there is very, very, very rarely a legitimate reason for integration tests to need multiple hours to run. Even in the worst case, you can just parallelize the tests on different hardware; if you can't parallelize, that is a problem with your tests and not a real excuse.

pydry · on March 31, 2023

I agree that parallelization is often a good way to speed up your CI pipeline, but the OP's suggestion of shifting integration tests to unit tests is frequently the worst thing you can do because it very often changes a test that tests into a test that is just a pale imitation of the code under test - a mimetic test.

I worked on a big old ball of mud once where business logic was smeared across 6 or 7 different microservices. The ONLY way to gain confidence that the app did the right thing was to test it end to end which, of course, took ages, and meant hours long CI pipeline (longer if not parallelized).

My coworker tried to replace some of those tests with "faster" unit tests around bits of those microservices but all they did was "lock down" the horrible architecture such that it became impossible to refactor and consolidate the business logic without breaking a bunch of unit tests.

rakoo · on March 31, 2023

How many times have we complained the state of the infrastructure is bad and we should spend some time to fix it, but business imperatives prevent us ?

In your story the team was given time and resources to actually do it, at last. If it wasn't, the problem might have continued for months or years still unnoticed. It's not that hollow a victory.

lisasays · on March 31, 2023

Long ago we had product builds that took hours and tests that often ran overnight, and sometimes over days.

And I miss those days as well. I think the author's point was not a wholesale dismissal of the value of extensive testing, as such.

But rather - whole point of a CI/CD pipeline is that you're supposed to get the feedback ... reasonably fast, right? That's what the "C" is supposed to stand for, pretty much (along with adjacent matters of being explicitly tied to your commit/merge process). So they aren't saying there's no value of all that testing -- but rather, if it takes hours and hours (when it was supposed to be nearly synchronous with your commit process) -- its values does attenuate.

Or if you're spending resources making it faster - rather than figuring out what's wrong with your code or models in the first place - that's another tax to pay.

rickette · on March 31, 2023

100% this. The number of discussions I've had about the duration of a build/deploy pipeline... Like you said: having proper tests is prio #1, execution speed is second.

Great so your build finishes in 3 minutes but you have no idea if the product works. While my build takes over 30 minutes but I fully confident the product works. Which one would you rather have.

aliasxneo · on March 31, 2023

I suspect this is the result of an overbearing importance on cycle time. I know a colleague who is founding a company solely based on reducing CI cycle time because business owners seem to view it as a critical measurement.

justin_oaks · on March 31, 2023

> Reproducibility matters

> Immutable infrastructure removes a whole class of bugs.

Story time. Back around 2010 I was tasked to figure out why a server application written in Java was running out of memory. It happened often enough that the application had to be restarted daily. I was not the one who maintained the application, and I wasn't the one maintaining the servers it ran on either. I was just someone good at solving these kinds of problems.

The odd thing was that the problem only existed on one server in production, even though the other servers should be running the exact same thing. I ran load tests on a staging server, also running the same code, and was unable to reproduce the problem.

With VisualVM and some heap dumps handy, I narrowed down the problem to database-related objects, specifically objects from the Oracle JDBC driver were taking the most memory. I tried checking what was different between the servers that would cause one to have an issue and not the others. Was one getting more load than the others? Were the processors different? Memory? Disk speed?

To make a long story short, I spent entirely too long debugging the problem and eventually found out that one of the servers had a different version of the Oracle JDBC driver than the others. And it seemed that particular version of the driver had a memory leak bug. It took me so long because I assumed that nobody would have been dumb enough have different library files on different computers.

I could kill whoever put that file on one server and not the others! It must have been done manually. I never did find out who did it or why.

Immutable infrastructure would have prevented this problem, but this was back in 2010 where we ran our own servers and immutable infrastructure wasn't a possibility.

If we had sufficiently reproducible deployments then we could have just redeployed the application and the problem would have gone away. Alas, the Oracle library wasn't part of the application code, but instead was part of library code that was installed on the computer and never changed (or so I thought).

It's a godsend to be able to deploy containers that not only contain your application code, but also your server software and their dependent libraries. Reproducibility is so much better that way.

mjr00 · on March 31, 2023

Yep, similar story as well from me, as I'm sure anyone who was doing software in the pre-cloud, pre-Docker era has...

Worked at a consulting firm doing enhancements to a legacy hotel reservation system, and we had just finished our latest feature, which was integration with another, even more legacy reservation system (Visual FoxPro, yum). This was old-school consulting style, so we developed internally with our test systems and no access to the real systems, then shipped over an installer .exe and let the client, who ran ops, handle the installation and deployment into IIS. We built it, they deployed with no issue, test reservation worked, easy.

A month later we got a report that about 20% of online reservations using this system were failing. Everything was working on our test systems, and it was working 80% of the time on their systems, so we assumed that it was a weird race condition with our booking service, which ran mostly asynchronously. We added a ton of additional logging to that service to try to figure out what was going on, sent them an updated version to deploy, and waited a bit.

A week or so later I went over to the client's site to look at their production systems with their ops person. They had 5 servers servicing requests. We looked at servers 1-4, everything seemed fine. Open up server 5... wait, where's the log file? Process was running, IIS was responding, but nothing was coming out. Check the permissions on the folder, and lo and behold: the fifth server had a typo in the permission ACL entry for a folder needed for the legacy integration, which by fortunate coincidence was also where we had set the log files to write. Fixed the typo, the problem instantly went away completely.

And being a consultant/client relationship, of course there was a lengthy legal dispute afterwards about whose fault this bug was, as that determined if it was covered under the original statement of work or we could charge more for it... one of the experiences that made me realize I needed to be working for a product company, not a consultancy.

ysleepy · on April 1, 2023

I've seen similar stuff, for me -XX:+HeapDumpOnOutOfMemoryError is a great thing, especially if set up to keep at max one dump to avoid out of disk errors.

I feel managing dpendency jars separately was weird even in 2010, I always do a shaded jar or at least an maven assembly folder as a deployable to make sure the server runs the same combination as on my machine.

ilovecaching · on March 31, 2023

The Makefile one is important. Make is still the best tool for managing a dependency graph, adding project commands, building artifacts of any kind, whether it be building docker images, binary objects, or spinning up integration environments. You can use make anywhere and everywhere, it's powerful but it's also simple on the surface and battle tested.

For DevOps in general I have learned to KISS, always know how your tools work, and focus on observability and low hanging automation. If you can't observe or understand how your systems works, you're screwed. You need metrics, logs, statistics, in one place that you can easily build queries with. You should be able to see everything at the fleet level down to the innards of each machine with no troubles.

pydry · on March 31, 2023

>Code is better than YAML.

This is too broad a backlash against it. Where YAML is used in problem domains where the configuration complexity is strictly limited and "etched in stone" it works very well. Nobody tends to champion YAML in this use case because it becomes somewhat invisible. It Just Works and nobody complains, but nobody raves either.

In domains where you have to start templating YAML, it fails horribly of course. This is YAML being used in a domain where it either shouldn't ever have been used or where APIs of some kind should have been provided alongside to accommodate dynamic use cases.

What "works" as configuration is often very hard to know up front, but the above attitude is throwing the baby out with the bathwater rather than seeking the correct balance.

Turing complete code is inherently more difficult to parse, understand and modify clearly than a config file and much more susceptible to technical debt.

>Your Integration Tests are Too Long

This is also a wrong attitude, I think. Faster feedback loops are great, but an obsession with speed typically leads to a lot of very quick tests that test very unrealistically - often missing entire classes of bug.

I've worked on a bunch of projects which had test suites that completed in under 10 seconds and never caught any bugs. Meanwhile, the team would end up leaning on manual QA to replace "slower" integration tests and the feedback loop on that made 5 hour CI runs look entirely worthwhile.

brodouevencode · on March 31, 2023

Seconded on YAML. I often wonder if the people who say this are attempting to build their own tooling or have such sufficiently complex systems (which I would also question as to why they are so complex). I've spent the past 10 years in very large media and retail establishments and cloudformation/terraform/k8s manifests have been useful *enough*. I couldn't imagine trying to template/macro yaml files without first looking at some sort of modularity native to the platform.

sleepybrett · on March 31, 2023

Have a fast suite and a long suite. Run the fast suite per commit, require a successful of the long suite before the merge.

Aperocky · on March 31, 2023

What is DevOps?

I've always thought dev ops was where one is both a dev and ops, but formally dev, and then also responsible for operations.

Apparently that was not the case, there is dev, and then there are devOps. Apparently the ops part is more important for devOps, and dev is actually geared towards the ops end and not actually developing features.

I was confused because I design and develop features, I build ops tools, I build and deploy CI/CD pipelines and I go oncall for bugs that happen. I thought I was devops, but apparently this is not how devops is defined in the broader industry. And apparently everyone else have much stricter role separation.

Jtsummers · on March 31, 2023

It has at least 3 meanings:

1. DevOps: A dev who can do ops and will do ops (because their employer won't hire actual operations experts)

2. DevOps: Using more development concepts and tools in operations (i.e., an operator who can do dev, but often in more of a support/deployment than product development capacity)

3. DevOps (original): A way of working that promotes collaboration, rather than separation, between development and operations (among other things).

hdjjhhvvhga · on March 31, 2023

4. A cloud engineer.

This one seems more and more common, even though CI/CD has no inherent relationship with development or deployment in any public cloud.

Sahbak · on March 31, 2023

5. YAML engineers :)

0xbadcafebee · on March 31, 2023

I agree with most of his points, but disagree on some:

- The value of a CI/CD pipeline is the value of its output. Doesn't matter if it took 5 seconds or 5 days. I have worked on pipelines that were shitty and long but delivered enormous value, and quick ones that were noisy and useless.

- The "code is better than yaml" post misunderstands what YAML is, and then says you should use a general purpose programming language for configuration. But if you're programming, you're not configuring, you're programming. My point is, both code and YAML are completely wrong solutions to the problem. Look at configuration file formats for 20+ year old programs as an example of good configuration. (Hint: none of them are either YAML or code)

- "Release early, Release often" only works for certain products/services. It's not a good idea to deploy 10x a day to an airplane in flight. (No, contrarian who's about to reply, it's not a good idea, just shut up.)

- Declarative configuration isn't a thing. It's just configuration. Non-declarative configuration isn't a thing either, that's just instructions. Declarative doesn't mean anything. It's a red herring people repeat because they heard someone describe it and it sounded smart.

- Do not 'set -o pipefail' by default. It will only waste your time when your scripts start failing because some command in the middle of a pipe returned non-zero while still outputting the result you wanted, and you spend 3 hours adding debugging to figure out where the error was coming from and write some extra error handling code. Just check if the result of the pipe looks accurate. Same result, fewer failures.

- If you have to do a simple task more than 3 times, check if (T*N)>A, or if it's a value chain bottleneck. If neither is true, don't automate yet.

lazyant · on April 1, 2023

> My point is, both code and YAML are completely wrong solutions to the problem. Look at configuration file formats for 20+ year old programs as an example of good configuration. (Hint: none of them are either YAML or code)

I take configuration as custom values for an environment, read at start time by the executable code, that are easily readable for humans (a non-programmer ops person could change) and easy to produce/read/manipulate.

From that point of view, coding is different as you say, while it gives you more power it's just not very readable.

All the config files I've worked on for 20+ years are a way of expressing KEY=value, with some nesting allowed to not repeat yourself (.ini, Yaml, json, even XML basically) so in my mind they are all basically equivalent.

0xbadcafebee · on April 3, 2023

Some examples of config files that are not KEY=VALUE: Apache, X11, resolv.conf, sudoers, syslog.conf, mtab, inittab/gettytab, ntp.conf, passwd/group/shadow, ldap.conf, PAM, openssh, CUPS... just a few on my desktop.

These files support a wide variety of functionality beyond what a data-serialization format like YAML support. The functionality is designed for humans to input data germain to the configuration of the application in an easy way. That is to say, it both makes the admin's job easier, and tells the program what it wants to know.

Data serialization formats (JSON, YAML) are particularly poorly suited for configuration. JSON doesn't even support comments. YAML does, but then forces you to correctly format the document in a non-human-intuitive way, making the admin's job harder, and not necessarily telling the program what it wants to know. It has been well established how many problems humans encounter editing YAML files, to the point that people just take out features to make it less problematic, though of course some problems remain. They could eliminate these problems by using an established robust configuration format, or even making up their own, but modern programmers have a poor education in software design and avoid anything that isn't trendy or popular.

.INI is a decent enough format, with the caveat that it was never a standard, so parsing is always application-specific. At that point you might as well just make your own configuration format that serves the admin and programmer better than trying to force complex configuration into simple KEY=VALUE pairs.

XML is sometimes actually useful, but time has shown that it is not human friendly, and often not even program friendly.

nikau · on April 2, 2023

> - Do not 'set -o pipefail' by default. It will only waste your time when your scripts start failing because some command in the middle of a pipe returned non-zero while still outputting the result you wanted, and you spend 3 hours adding debugging to figure out where the error was coming from and write some extra error handling code. Just check if the result of the pipe looks accurate. Same result, fewer failures.

Exactly how do you "check if the result of the pipe looks accurate" without having to get a human to check the log file?

The whole point of automation is to get rid of human interaction even if it requires some upfront work to implement error handling where appropriate...

0xbadcafebee · on April 3, 2023

Well it's context-dependent, but there are many different strategies.

Most well designed command-line programs output to STDERR if they have an error, and STDOUT if they don't. So the simplest possible method is just to see if STDOUT was 0-length or not. If it's not 0-length, then assume the command completed successfully. grep can help validate the output. Depending on the command, it may resist partial failures; for example, many commands that output a JSON blob will wait until they have all their data before outputting the blob.

If the script you're writing depends on always working 100%, then you can do a lot more work to check for errors. First you would enable set +e, because if you exit on error with pipefail, you can't even check what the error was from PIPE_STATUS. Instead you can inspect every pipe result and tell your user what part failed, and decide if you should continue, print only potentially partial results, retry the operation, or fail immediately.

Most programmers don't handle error conditions properly and just immediately fail on any error, which itself causes errors which needs a human to inspect a log. Worst case, the shell script without pipefail just keeps working, when with pipefail it would have failed a lot.

theden · on March 31, 2023

>Makefiles are unreasonably effective.

I'm torn, I like Makefiles, but I've seen a lot of unreadable Makefiles with a lot of interpolation and ugly escaping, including cases where what should have been a proper shellscript (his point 25) crammed in a Makefile. I think Makefiles should be dead simple to see what a target does

Overall I agree, but IMO people ought to drop the 10k hours Gladwell term, to be frank it's bullshit.

spacetime_cmplx · on March 31, 2023

After using https://github.com/casey/just in a medium sized project, I wholly recommend anyone even slightly frustrated with Makefiles to switch to this.

C++ : C :: Just : Make

More sensible defaults, fewer footguns, just as expressive (pardon the pun), but still nearly the same syntax and paradigms so you don't have to learn a completely new scripting language.

The biggest downside, however, is that it's not installed by default on many systems and it's not in the official repos of some major distros. If that's a dealbreaker, stick with Make.

hiepph · on March 31, 2023

Makefile is like Bash. It's effective, ubiquitous and pleasant to work with simple logic.

In my case, if I need to scale to a more complicated problem, I would use Rakefile and Ruby (in replace of Makefile and Bash).

silverwind · on March 31, 2023

Every tool can be misused and Make has a ton of common pitfalls for novices to fall into. Properly written Makefiles do feel like magic (in a good way) when they run.

justin_oaks · on March 31, 2023

> Linear git history makes rollbacks easier.

Not only that but it makes it easier to read the history. I haven't yet been convinced of the benefit of merging pull requests instead of rebasing them. The history ends up so convoluted.

lta · on March 31, 2023

I'm likely to share this doc to a few of my customers. I think I'd well summarized and reasonably balanced. I like very much they it explains well trade offs I tend to select naturally and sometimes direct have the energy/patience to explain

I'd be a bit harsher on lock-ins, which are rarely worth the loss of control in the long run, but otherwise <3

d_sem · on March 31, 2023

I think the comment "The value of a CI/CD Pipeline is inversely proportional to how long the pipeline takes to run." is a good starting position to have, that is then supplanted by expertise in what actually adds value in a given projects workflow.

brodouevencode · on March 31, 2023

> Immutable infrastructure removes a whole class of bugs.

I'd like to see some examples of this.

mdaniel · on March 31, 2023

Ask and ye shall receive: https://news.ycombinator.com/item?id=35389882

We run Bottlerocket (previously Flatcar, nee CoreOS) which itself is running SSM for "break glass," so: only about 3 directories (and children) can physically be written to, no package manager, no sshd, and in Bottlerocket's case even getting to a shell on the host is so complicated I have it pinned in our slack channel. I haven't tolerated "general purpose" OSes for production machines in over 8 years because jokers just love to "just this one thing to troubleshoot..." exactly like the story above said

justin_oaks · on March 31, 2023

Immutable infrastructure could prevent an idiot from 1) making changes in production 2) without performing automated testing first and 3) without adding the changes into source control.

While I don't use immutable infrastructure myself, I can see it being of use in some organizations.

fknorangesite · on March 31, 2023

> prevent an idiot

Doesn't have to be an idiot - just a normal, fallible human.

justin_oaks · on March 31, 2023

Quite right. I initially thought of someone who edits things in production as an idiot. But it could be someone who doesn't know any better, or whose management has them to do unwise things.

We normal, fallible humans still benefit a lot from using good methods that take human error out of the equation.

lazyant · on April 1, 2023

any manual undocumented change may lead to hard to debug problems, an example in these comments https://news.ycombinator.com/item?id=35389882

esjeon · on March 31, 2023

This is gold. Really. Well balanced, and much details.

dudefaux · on March 31, 2023

Really good read. Well balanced. // someone in faang