Why should I do production support?

crossroadsguy · on Sept 22, 2020

You don't get anything out of burning out. You just burn out. And that time is not coming back. This post seems like an apologist talking.

Production support alone is not that much of a problem. What the author skipped (conveniently? or forgot to mention?) is - it's really the "on call" phenomenon that's the problem.

The "typical" on-call - where when you are on-call you are magically on-call 24x7. Yes, during your sleeping hours as well; as if that's less important and the company can avoid spending money to hire dedicated support for those hours and instead make you suffer (yes, it's just that - there's no other name for it like "satisfaction", "learning", "growing" or any of those buzzwords).

You want engineers to do production support? Well, let them do it during normal office hours and only few times a month. Or heck, let them do it for weeks but let them punch in and punch out normal office hours. Let them choose to do only one half of the day and have someone else willing to do the another half.

There's no excuse for burning out engineers (esp. unsuspecting youngsters) by pushing them into ungodly hours of work ruining their health among other things while trying to constantly tell them - "do you even realise what a service to humanity you are doing!".

It's just exploitation.

dopylitty · on Sept 22, 2020

The point about on-call is really critical and really on-point.

If a company thinks an application is important enough to run 24x7 then it should staff for 24x7 support. Stealing wages from workers by expecting them to be available 24x7 (on-call) is an absolute abuse.

It also leads to burn out, poor performance during the day (how is a dev's development ability when they were up at 2:30am on an incident call the night before?), and clouded thinking causing mistakes or impacting recovery time during incidents.

nineteen999 · on Sept 22, 2020

> If a company thinks an application is important enough to run 24x7 then it should staff for 24x7 support

And where it really matters, they do. My team and I build and manage a large Emergency Services telecommunication network. We have Tier 1/2 operators on shift work 24/7. Tier 3 staff (programmers, system integrators and administrators) are their escalation point for critical issues outside of business hours.

> Stealing wages from workers by expecting them to be available 24x7 (on-call)

The Tier 3's that are on-call in our environment are on a rotating roster are compensated nicely for being prepared to answer the phone outside business hours. Frequently they don't get called during their week at all and it's free money.

> how is a dev's development ability when they were up at 2:30am on an incident call the night before?

Easy, as well as the financial compensation, we give them time in lieu. Two hours callout in the middle of the night, two (paid) hours given back on their next working day, or whenever they prefer, subject to availability of other staff.

There are simple solutions to these problems, and where they matter, they are applied. Granted things are very black and white for us as lives are potentially at stake, but any company that wants to have 24/7 engineers available needs to pay for that kind of support.

piokoch · on Sept 22, 2020

"Frequently they don't get called during their week at all and it's free money."

That's, I think, is a wrong perspective, when people are on call, they have to be somewhere near their computer/internet connection and be ready to work (so it is not just you can go to a party and if call happens do some quick fix in a toilet).

On-calls cannot do what they want with their time, so they don't get money for free.

nineteen999 · on Sept 22, 2020

Yeah this is true although our compensation for their time is pretty generous. As a team we are pretty flexible too and if somebody has an event they want to go to they can always find someone else on the team to cover or swap with them for the night.

nemetroid · on Sept 22, 2020

> Two hours callout in the middle of the night, two (paid) hours given back on their next working day

This still sounds cheap to me. I have never worked on-call (and I never plan to), but the exhaustion cost of working two hours in the middle of the night is not equal to two hours of uninterrupted sleep. I would expect to get at least a half day off (paid) for any amount of middle-of-the-night work.

nineteen999 · on Sept 22, 2020

The thousand bucks extra they make a week whether they get called or not makes up for it. It's not like I clock watch my team anyway, if they took an extra hour here or there I don't bat an eyelid, they're all hard workers and they get the job done without complaining.

marcellus23 · on Sept 22, 2020

> Easy, as well as the financial compensation, we give them time in lieu. Two hours callout in the middle of the night, two (paid) hours given back on their next working day, or whenever they prefer, subject to availability of other staff.

Every company should do this but none I've worked at do. To be honest, I just take the makeup time myself.

sokoloff · on Sept 22, 2020

Most companies don't watch the clock for their software engineers. If you get called in the middle of night and take the makeup time yourself, does the company give you time in lieu or not? (By outcome, I would say that they do.)

marcellus23 · on Sept 22, 2020

That's fair. It would be nice if it was an explicit policy though. Otherwise I feel like the company is just exploiting engineers who may not know any better.

kolla · on Sept 22, 2020

Well two callout hours in the night should at least be 4 hours paid during day.

fatnoah · on Sept 22, 2020

>Easy, as well as the financial compensation, we give them time in lieu. Two hours callout in the middle of the night, two (paid) hours given back on their next working day, or whenever they prefer, subject to availability of other staff.

I didn't note this in my post above, but I always gave time-in-lieu for any late night activity. However, the thing that REALLY worked best was allowing the Engineers to prioritize reliability. I had to fight to make it happen, but going from nightly to every couple months volumes was worth it.

HenryBemis · on Sept 22, 2020

The companies do this for the money. And the people what work in those companies have no real sense of the risks, or they just care more for the numbers and want to roll the dice.

I would not trust someone who I just woke up at 2am to do something. He/she is mid-sleep. They will be prone to errors, they will be super tired, and I just ruined their next 1.5 days that it will take them to recover from that.

This is not a job where you live boxes where intellect is not needed as much, (strength and stamina will also be affected by a mid-night alarm). You want your folks to be 100% on par, otherwise they may make things worse.

HenryBemis · on Sept 22, 2020

Type: live = "lift boxes" (just spotted that, apologies I can't edit and correct)

MattGaiser · on Sept 22, 2020

They should also allow time and budget for building an application that can run 24/7 without too many errors.

"We need [insert thing manager asks for here] immediately" has consequences.

mtberatwork · on Sept 22, 2020

It doesn't have consequences though, which is why it keeps happening.

burnt_husk · on Sept 22, 2020

Having worked in the tech sector for a while now, having burned out once, and having been on-call 24x7 far more than is healthy, I would say exploitation is the name of the game to a lot of managers.

You are less of a person and more of a means to an end. A tool to achieve something, and some tools are disposable. It can be of career advantage to a manager to burn out engineers. Maybe instead of spreading 24x7 on-call across 3 teams in three timezones, you put it on 1 team in 1 timezone. By doing so a manager can achieve a lot with less resources, and hopefully secure their own elevation up the corporate ladder before the cost of their strategy becomes evident.

The cost of burn out I think remains hidden, in technology there's a constant flux of staff anyway, teams being being created and dissolved, in all the noise a few people being exhausted and bailing from the company is hardly noticed. Perhaps they said something before they left, but it's best for everyone in middle management if the burnt out individual is labeled the problem, they were a bad culture fit you see, a grumbler who didn't have what it took.

danielheath · on Sept 22, 2020

Burning out from on-call comes from not being able to fix the underlying causes.

I'm happy to hold the pager if I've also got the right to block/rollback deploys until the system is stable - my current job has had two out-of-hours pages in the past year, and we're in the alexa top 10k so it's not like there's no traffic.

taylodl · on Sept 22, 2020

Bingo! There's nothing more motivating to "prioritize" fixing a problem than being woken up at 2 a.m. being affected by it. I've worked in places where one team writes the code and another team supports it and it's always the same: the code is absolute garbage. It ends up being too complex and having way too many moving parts, and is impossible to diagnose. I've worked in places where the team who writes it, supports it - 24x7. Their code is always super simple, easy to diagnose, and easy to maintain.

I've been at this for 35 years at many companies and working with many teams and it's always the same: if you want good software then make the team creating it also support it. In every case I've experienced it leads to software requiring little to no support, easy to maintain, and easy to extend. Why? Because nobody wants to get up in the middle of the night or work weekends and moreover, they'd rather be adding features than limping along with existing features.

mjayhn · on Sept 22, 2020

I'm in ops and this is a huge, huge headache for those of us on our end at companies where we're not given Google SRE powers to control release trajectory based on failure budgets of some sort.

I've worked at too many places that had no SWEs on-call for the on-call alerts that I get, which the vast majority of the time involves throwing a bandaid (as in redirecting traffic, etc) in front of an internal bug that I hope eventually gets fixed once the RFO/etc has been submitted before it hits my NEXT on-call rotation or my poor coworkers.

Without SWEs on my rotation they don't understand the immediacy. They aren't the ones getting their Christmas week interrupted every 4 hours while ops keeps the house running. In Ops having your entire day ruined by various on-call alerts usually feels like you're working without any breaks and nobody even cares.

Anyone want a bad golang developer, wannabe ex-ops person who knows a lot about platform reliability and o11y and wants to focus on the golang end finally? I'll make your teams automation and o11y purr no matter where it is (bm, cloud, global pops, serverless..)..

afberg · on Sept 22, 2020

I think there's only one way to solve this — which I've been unsuccessfully advocating for at my current company — and that is paid voluntary on call schedules.

It creates an actual market for on call work where engineers can simply say no to the extra cash if they don't like work taking up their nights and weekends. If the company is having trouble with no engineers wanting to be on call the pay is simply too low and needs to be increased. It's a job like any other and should be compensated as such.

In the end I honestly believe it will be beneficial for the company not having engineers burn out so quickly. Compensation also clearly sets the expectations — if you're being paid to do it you'll take it more seriously.

Just my 2 cents

nimrody · on Sept 22, 2020

But does it give engineers a good incentive to improve product quality and reduce the number of production incidents?

Where I work we are not on-call. Nevertheless, I try to help the ops team when they encounter issues. This does make you improve logging and error handling since you know it takes a lot more time when it's difficult to filter logs for the interesting events.

Engineers not exposed to production issues and customers will never understand why you need these extra measures.

tjpnz · on Sept 22, 2020

Our on-call system gives everyone an allowance per service plus additional time off if you're on-call on weekends and public holidays. You get both regardless of whether you get paged. Getting paid to do nothing is a great incentive for pushing quality, especially when you're on-call for more than one service in a given week.

tybit · on Sept 22, 2020

I work somewhere that does this and it works surprisingly well. It’s rare to find an employer that’s willing to look after their staff though.

hinkley · on Sept 22, 2020

> "do you even realise what a service to humanity you are doing!".

I don’t know who needs to hear this besides me twenty years ago, but if you want to do charity then go home at 6pm and volunteer at a real charity. Don’t do it for a wannabe robber baron who will not share with you. Don’t do it for someone where even an emotional payoff is years away or may never come.

Find something else you care about and help some people just because. Not because you’re getting under-paid and over-guilted to do it.

tasogare · on Sept 22, 2020

Well even "real charity", including very famous ones, are quite shady too: volunteers do work for free while donations goes towards C-exec 5-6 digits salaries.

m463 · on Sept 22, 2020

Various places I've worked put certain people on-call. They were salaried but somehow paid extra for the time they were on-call. It seemed to be voluntary and the people who did it liked the extra cash.

On the other hand I had a friend at a very large and well known company. He got a job offer and was hired into one department, but he wanted to take a little time off between jobs before he started. They somehow convinced him to start saying there was a holiday coming up and he could take the time off then.

and as soon as he came in he started getting calls 1am 2am 3am etc...

so he left.

And they cajoled him back saying things were different and he finally bought it and went back.

same thing happened again, and he quit for a final time.

One part of the problem was that he was a US citizen working with a bunch of H1B visa folks and the company could get away with that sort of stuff. H1B folks will say yes sir, no sir because their dreams of living in the US are tied to keeping their job at all costs. and then the bad work culture festers.

throwaway0a5e · on Sept 22, 2020

I'm not gonna advocate for having engineers directly field customer calls at 3am (because that's a recipe for unhappy customers) but as someone who fields customer support calls/tickets as a large part of my job I feel very comfortable saying that a "sizeable enough to be a big problem" will not build systems that are supportable without engineering escalations unless they get support trying to escalate an issue or conference them in at 3am from time to time.

fatnoah · on Sept 22, 2020

>The "typical" on-call - where when you are on-call you are magically on-call 24x7.

I ran the Engineering org for a startup and we had a small, 3 person Ops team that handled initial triage of events. About 75% of these issues were Engineering-related. My solution was to a) create an on-call rotation for Engineering and b) allow the Engineers to prioritize reliability work.

It sounds like a no-brainer, but I had to fight with the rest of the exec team to allow b) to happen, since it came at the expense of the product roadmap. I eventually won the fight and our nightly on-call volume went from 1-2 incidents per day to 1-2 every few months.

Vindication came about a year later when we were acquired by a large company. As part of the due diligence process (including 18 hours with me going over technical details in front of 30 senior folks from the acquiring company) we got major kudos for having a level of reliability that far exceeded what they typically saw for a company our size.

hinkley · on Sept 22, 2020

It’s usually a case if taking responsibility for things you have no power over.

When they ask it’s usually after a giant hole has been dug, and patterns have been set. If you knew from the outset that you would be the support team then you’d have prioritized some other tickets. You’d have increased the estimates on others. You would have refused to work on these three, you would have argued vigorously about these four decisions, and you would insisted your boss fire “That Guy” months ago because his code is garbage and his only real skill is articulate deflection.

This group of folks wants several somethings for nothing. One of them is labor, another is somewhere to assign blame. They are grooming you for failure and we all deserve better.

mtberatwork · on Sept 22, 2020

> There's no excuse for burning out engineers (esp. unsuspecting youngsters) by pushing them into ungodly hours of work ruining their health among other things while trying to constantly tell them - "do you even realise what a service to humanity you are doing!".

In the US, this is happening across the board and not just in tech. The expectation to always be available [often without monetary compensation] is sadly the new normal. Without strong labor laws in place, this implicit form of exploitation will never cease.

rdtwo · on Sept 23, 2020

Everyone is doing it is the excuse

lifeisstillgood · on Sept 22, 2020

The cure - Unionisation

seriously

throwaway0a5e · on Sept 22, 2020

Police officers have pretty much the strongest unions out there and the junior folks on the force (i.e. exactly the kind of people who get put on call in tech) generally wind up stacking absurd combinations of shifts in order to be paid competitively. Rail workers, another strongly unionized profession, have it no better.

I get that you like unions but just because you have a hammer doesn't make every problem a nail.

Galanwe · on Sept 22, 2020

The cure - proper labor law.

Seriously

Ma8ee · on Sept 22, 2020

Which is one of the things that Unions are good at pushing for. Who do you think create laws? A single guy walking up to his Member of Congress.

Galanwe · on Sept 22, 2020

I guess it depends on everyone's personal experience with unions.

From where I come from, unions are mainly a way for lazy employees to get immunity while doing nothing all day long.

I would still maintain that if you live in a country where it's legal to be called at any time, any day, then you have a third world class labor law - go downvote US :>

lifeisstillgood · on Sept 22, 2020

Yeah this is like two hands drawing each other.

dtoma · on Sept 22, 2020

Historically, I don't think we've made much progress in terms of pro-worker labor laws without said workers first having to get together to create bargaining power. By default the power is in the hands of the ruling class, and they have no interest in handing out paid holidays, reasonable working hours, or sick leave.

Galanwe · on Sept 22, 2020

Well, there is still that old thing called universal election, where basically people can vote for their leader.

If other things are considered more valuable than proper labor law (like, say, building a wall), then I guess voters get what they deserved.

bowmessage · on Sept 22, 2020

heh, I started searching on Spotify...

hinkley · on Sept 22, 2020

The album you’re looking for is called Disintegration, not Unionization.

shoo · on Sept 22, 2020

playing it in reverse should do the trick

hanniabu · on Sept 22, 2020

You're talking as if they care about the exploitation. It's done purposely. They simply don't care when it means greater profits. They also believe there's an endless line of devs they can burn out and throw away.

zeckalpha · on Sept 22, 2020

If the engineers aren’t oncall, who is? Is it okay to exploit non-engineers? If anything, it is less exploitative to have those who are empowered to improve their situation oncall.

detaro · on Sept 22, 2020

Given that the parent comment is clear about the problem being the 24x7 expectation: Someone paid to work or be available that shift, engineer or not?

gertrunde · on Sept 22, 2020

There are too many variables and factors involved really.

Is someone covering the full 24 hours, or is it just out of hours?

Are they under a one week in X rota, or expected to be permanently available?

Are they expected to cover anything/everything, or are they just the escalation point for their specialist area?

What other support is available? (i.e. if the shit really hits the fan, are you left to deal with it alone?)

On average, how many times are callouts expected? There's a big difference between half a dozen times a week and half a dozen times a year.

How are the extra duties remunerated/compensated? Is there time off in lieu?

There's a massive spectrum there ranging from hugely unpleasant and not worth the money to not particularly onerous and helpful extra cash/time off.

zeckalpha · on Sept 23, 2020

No, they said “normal office hours” are for engineers. Someone needs to take the other shifts.

magicalhippo · on Sept 22, 2020

Our support folks handle the on-call support. They do one week each on rotation, they get some extra compensation and the following friday off.

If it's a serious issue they can't handle they might wake up one of us programmers, but usually they can find some temporary fix or workaround until the next morning.

zeckalpha · on Sept 23, 2020

Do permanent fixes get applied or do the mitigations just get re-applied again and again?

Are the support folks considered engineers?

MattGaiser · on Sept 22, 2020

People seem to get stuck in support work. That would be my aversion to doing too much of it. At some of my prior workplaces, there have been people who have been so good at support that they never got assigned to do new development work. Engineers far more experienced and senior than I was (or currently am) were dealing with trivial issues as they were good at it while I got the nice greenfield project.

They had to quit to get out of support.

x87678r · on Sept 22, 2020

This has been a huge problem for me. I've always loved working with customers and dealing with real problems under pressure. So have enjoyed my time on support rota - but quickly L1 guys learn I'm better at support that other team members so everyone contacted me directly. The grumpy devs who pretend they dont know anything about the live system get interesting projects, no interruptions and a much better resume. I've learned to say no. Its also a good reason to move teams as you're not the super experienced guy who has to fix the urgent gnarly support issues.

jasonlotito · on Sept 22, 2020

Of course, if they were the ones who originally built those products, they should support them. Why should people that create software that requires so much support they quit over it be entrusted with yet another greenfield project without fixing the stuff they built in the first place?

runawaybottle · on Sept 22, 2020

Some people have a switch then turns them into bug fixing demons. I know I get a rush out of fixing prod issues. People notice that, especially when you just jump into foreign code that you never touched and come out with a fix within an hour or two.

woutr_be · on Sept 22, 2020

I work in finance, where we have clearly defined processes for production support, simply because developers don’t have access to production environments.

However, production support teams don’t have a real understanding of our application and how it’s build. So most of the times you have engineers on call with production support, telling them how to debug the problem and come up with relevant logs.

It’s incredibly infuriating and time consuming, and I absolutely hate doing it this way.

90% of the time you also get incredibly vague bug reports with irrelevant logs, and a description of what they think the problem is. Most of the time you need to spend another day finding correct logs and somehow debugging it. Most teams log every single request with all parameters and payloads because they can just replicate the problem locally instead of relying on production support.

We’ve long advocated for either having dedicated support or have engineers on some sort of schedule that can do support.

pawelmi · on Sept 22, 2020

I'm curious, why can't people that created the system take part in production support. You've mentioned finance, that I presume require high level of security, but at the same time there are also people on the other side, just not knowing the system first hand and perhaps having skills different from knowing how to debug software. They can see all the data and in theory modify system behavior, eg modify/install any binary. Why are thrall developers less trusted, is it some kind of logic or regulation or just "the way it has always been done" in finance?

x87678r · on Sept 22, 2020

> just "the way it has always been done" in finance

Twenty years ago for many systems devs could do whatever they wanted in production. There were insider trading scandals and combined with SOX, regulators cracked down on it so now devs have lost at least write access. If you have an old system that relied on knowledgeable devs to fix stuff its a terrible situation where people just quit and no one can support it.

woutr_be · on Sept 22, 2020

Part of it probably stems from that it's always been done this way, and the processes haven't evolved over time. Another part is also to protect the bank against rogue employees, it wouldn't be the first time a developer made changes against a production database.

You're right that production support has access to those systems, and could potentially make changes and install different binaries, but the amount of people that can do that is extremely limited. Every change also requires a change request that needs several approvals, to request data you need another data request.

HenryBemis · on Sept 22, 2020

Adding to that (because only when I use this example people understand clearly):

CompanyA is using ITS OWN assets, funds, IP, etc. you own it, you can burn to the ground.

BankB is holding other people's money. You can't go make a mistake, a bank losing 100m of OUR money and say "oops my dev made a mistake".

Edit: similar expectations are in publicly traded companies (aka companies where they use OUR money - we give them our cash and they give us stocks). This is why external auditors (e.g. Big4) do not like when they see "poor change management processes", such as inconsistent SoD.

woutr_be · on Sept 22, 2020

> BankB is holding other people's money. You can't go make a mistake, a bank losing 100m of OUR money and say "oops my dev made a mistake".

Not only that, but once that happens, regulators will come in, and everybody involved can be held liable. Not only will the bank be fined, but depending on how bad your fuck up was, you'll probably end up losing your job and might face further penalties.

So in the interest of everyone, it's best to just avoid it all together.

pawelmi · on Sept 22, 2020

Thanks for answers in this thread. I didn't really mean to have YOLO-type random access to production. I was hoping there are ways in between to bridge that gap between dev and ops in those systems, similarly how it has been done eg with SRE in more relaxed security applications. I was hoping for some solutions on the spectrum are adopted more, like mentioned cetralized logs stripped of private data or granting temporary audited access. But it seems with legacy systems this is much harder to implements. I believe there is an optimum balance where actually fewer mistakes could be made if both people developing and operating te system had more visibility into each other field. As for willful fraud attempts, well you can't rule out devs would do it, so of course there should be various barriers preventing that and proper change management, but, my sampling bias aside, when I look at some recent scandals in finance, take eg Wirecard as the last one, there is more often higher management involved than devs.

woutr_be · on Sept 23, 2020

At least where I work, every team can essentially decide what they do, as long as they follow a few basic guidelines. So newer projects usually have centralised logging and automated deployments. But sadly there’s still a wall between development environments and production, for good reason I think. Not everyone should have access to production data, so only a limited amount of people have access. Data is of course anonymised when send over.

But yeah, some legacy systems could be 5 years old, and that’s a long time in tech.

You’re right on the visibility part, but sadly that’s an organisational issue, you need higher ups to change this.

goatinaboat · on Sept 22, 2020

I'm curious, why can't people that created the system take part in production support. You've mentioned finance, that I presume require high level of security, but at the same time there are also people on the other side, just not knowing the system first hand and perhaps having skills different from knowing how to debug software

They can, they just can’t have direct access to live systems due to separation of duties. But there are methods for dealing with this, like centralised logging so a developer never needs to see the original log file on the problematic box.

woutr_be · on Sept 22, 2020

> But there are methods for dealing with this, like centralised logging so a developer never needs to see the original log file on the problematic box.

Indeed, most of our newer services have been doing that, so developers have direct access to logs, which makes our lives a bit easier.

But sadly a lot of systems our outdated, and nobody wants to invest time and money into implementing things there.

HenryBemis · on Sept 22, 2020

Seems like we were writing in parallel.

"We don't trust" a dev. The change management processes demand the existence of 1) Dev, 2) Librarian (we used to call them that)(that would review and transfer the code, or review and compile the code), 3) the prod sys admin.

Some orgs may have a slightly different setup, but in some form or another, but (these general) rules apply.

Today with tools like CyberArk it is easier to grant temporarily privileged access to a dev for production support, we also got the tools to trace/monitor/record access, so it makes the process auditor-friendly.

user5994461 · on Sept 22, 2020

Also in finance with dedicated support teams. Our support was great, some of our guys in Asia were outright fantastic at debugging.

To be fair, being great wasn't enough, their job was only possible because the company had unified tooling. A single deployment solution that was deploying near 1M tasks a day in the company, allowing all employees to lookup what is running where and see logs.

This made me appreciate just how useful it is to have both dedicated support AND unified tooling. The average company couldn't benefit from having folks on rota because it's impossible to figure out where anything is running.

woutr_be · on Sept 22, 2020

We still don't have a single deployment solution, we don't even have single hosting solution. We can choose between 3 different cloud providers, or dedicated servers. (They're pushing for cloud now)

The thing is, this all is pushed down from management. In my previous project, we tried to automate as much as possible, but at the end of the day, our production support still wanted to deploy manually. Our business still wanted to see manual end-to-end tests with screenshots.

Then there's also different regulations in certain countries where you need to host your application and database in the country itself, so that's another solution.

Working in finance can be a real eye opener sometimes.

arminiusreturns · on Sept 22, 2020

>However, production support teams don’t have a real understanding of our application and how it’s build.

This reeks of bad documentation to me (which finance is notorious for). If a dev has to be on to support normal prod ops thats largely due to errors in both documentation and often in poor tooling. Sometimes those errors aren't as much the devs fault because of management decisions, usually related to understaffing, but I hate how prod support gets shit on so often for failing to fix an issue when it's not really their fault.

woutr_be · on Sept 22, 2020

You're not wrong, the entire thing is because of poor management decisions and poor processes. I don't really shit on specific people, more on the entire process.

> This reeks of bad documentation to me

Not necessarily, you can document your entire application, but production support only looks at the logs, and does a data extract based on what they see. It would be far more beneficial if you had someone who has a clear understanding of the application so that they can help with debugging and actually solving the problem.

At the end of the day, production support are teams who help with 10-20 applications, it's impossible for them to truly understand specific applications. They receive a bug report from the business, investigate and extract logs, then pass it to the relevant development teams. If you need extra info, well though luck, you can reply to the ticket and wait for it to be picked up again. It's no surprise companies like this move so slow.

HenryBemis · on Sept 22, 2020

Adding to the above 100/% spot on comment; "we" in finance/banking do this because segregation of duties is mandatory, none of that DevOps nonsense ;)

In the off chance that a dev has the unique knowledge to solve a problem, they may get the firefighter/temporary elevated access needed, but will have to document the reason and the dev's actions very very well, because both internal and external auditors will zero in on that.

sdevonoes · on Sept 22, 2020

I like to write software but I don't want to be on-call if the software I wrote breaks at 3am in the morning. I do take my job with professionalism, I do write tests for most of it (not 100% coverage, but 100% coverage of the critical parts), I do monitoring (and answer and fix alerts if they happen during working hours) and I don't deploy on Fridays (and don't allow people to deploy on Fridays).

My code will crash sooner or later. I already know that. I don't write 100% bug-free code. But I cannot accept to give 100% of my time one week per month or so to a company in exchange for money. I just don't understand why people can't understand that I can be a professional only during 8 hours per day, but not more.

0xbadcafebee · on Sept 22, 2020

This attitude almost always turns into the following:

On-call: "Hey devs, I'm being woken up at 3AM because your app sucks. Please fix it." Devs: "Sure, no problem."

4 months go by

On-call: "These alerts are still coming in at 3AM. Did you fix the issue?" Dev: "We have a lot of work, we can't dedicate all our time to some minor problems, we have a deadline."

Next week, Devs are put on-call.

The alerts are fixed in two weeks. Site reliability goes up. Apps suddenly become more resilient to failure.

Honestly, the whole attitude of not wanting to work more than 8 hours is privilege. Most of the rest of the world works long hours. As a dev, you get a good salary and a job you don't have to break your body to do. The least you can do is be completely responsible for your own code.

And it helps you as an engineer. Like the article points out, it creates empathy for the users and product support engineers, it helps you improve architecture and app design, and it helps you understand different failure domains. You won't learn all that on your own time, especially without the scale of production.

cabraca · on Sept 22, 2020

Putting the blame on the devs is to easy. There are a bunch of management layers between the on-call team and the devs in your example. If management does not prioritize those alerts, its not the fault of the devs and its the wrong to put punish the devs for it.

> Honestly, the whole attitude of not wanting to work more than 8 hours is privilege. Most of the rest of the world works long hours. As a dev, you get a good salary and a job you don't have to break your body to do. The least you can do is be completely responsible for your own code.

unless i signed a contract that states i will do on-call, i'm not gonna do on-call. I doesn't matter how long the rest of the world works.

0xbadcafebee · on Sept 23, 2020

Say a plumber installs a new sink in a house. A year later, a weak joint fails and springs a leak, and the plumber is called to fix it. They could ignore the error, charge the person for the repair, and go on with their day. Or they could accept that it failed due to their own mistake and not charge the person.

Now let's say the plumber works for a contractor. The contractor tells the plumber not to tell the customer about the mistake, because it would make the contractor look bad. The plumber can choose to own up to their mistake to the customer and right the wrong, or they can do what the contractor wants and charge the customer.

On the one hand, the plumber might decide to charge the customer. They keep their job, the contractor makes money. On the other hand, maybe the customer is poor and can't really afford the repair. If it doesn't get fixed, the customer'll have to deal with the brokenness themselves, even though the plumber knows they caused it. But then again, maybe the plumber is broke and really needs this subcontracting gig.

There is no simple answer there. But I think that in the context of software development, in most cases, the answer is simple. Most of us are fortunate enough to have the extra time and money to spend fixing our own bugs, regardless of what we're told to do during the regular 9-to-5. When we have the opportunity to take responsibility for our actions, we oughta.

sdevonoes · on Sept 23, 2020

> This attitude almost always turns into the following: [...]

Well, that's another problem (the dev not being able to solve a bug that reappears at 3AM).

> The alerts are fixed in two weeks. Site reliability goes up. Apps suddenly become more resilient to failure.

I always wondered why DevOps has the "Dev" in its title. At least, in most of the companies I have worked on, it was DevOps the ones that were on call (payed), but they were very picky regarding what they can touch/work on (they almost never touched application code... we should call them "Ops" then, no?).

> Honestly, the whole attitude of not wanting to work more than 8 hours is privilege.

And it's a privlege I'm thankful for. What's wrong with that?

> As a dev, you get a good salary and a job you don't have to break your body to do.

We do break our body to do software engineering (our brains, to be more specifically). If you think physical work >>> brain work, well, that's relative. Every person is different, and for me, brain work is equally taxative as physical work.

> And it helps you as an engineer. Like the article points out, it creates empathy for the users and product support engineers, it helps you improve architecture and app design, and it helps you understand different failure domains

I know I can become better by working harder and smarter (it's obvious), but I just want to be the best version of myself by putting at most 40h/week. Isn't that something honourable in itself? Or does that make me a "bad engineer"?

0xbadcafebee · on Sept 23, 2020

DevOps isn't a person or a role. People who are called a 'DevOps Engineer' simply have a bad title. Our industry is very stupid.

But yes, they are often Ops, and in general they shouldn't be troubleshooting the application. They can isolate system issues by looking at and correlating metrics and events in the high level systems, but only a developer of the code in question can efficiently and effectively diagnose a specific bug in code in a reasonable amount of time.

You've seriously not heard the phrase 'check your privilege'? In general it means not to take it for granted, but specifically to not take advantage of it at the expense of others. Just because you aren't forced to care for others doesn't mean you shouldn't. If you have privilege, your moral imperative is to ensure it's used for good, not evil (indifference is often the latter when affected by those with privilege).

How is software dev breaking your body? Are you typing with your face?

If you want to be the best version of yourself within 9-5, please fix the bugs that alert on-call the first morning you hear about them. Most developers never do this, which is why they are put on call.

Last week I was on-call. The vast majority of the time, when it's the application at fault, I don't know who the hell wrote it or where to begin with troubleshooting it, so I need a dev to look at it, because we don't have time to waste if the product is down. This doesn't seem like a controversial idea to me: you are hired by a company to make a product that works, so if your product isn't working, you need to fix it. I can't always fix it. I need help sometimes. It's literally your responsibility as a professional adult to help.

If the product you work on has to work 24 hours a day, you have implicitly agreed to support it during that time. Otherwise you can get a different job where if the product breaks, it can wait until morning. I've had those too, and not being on-call was great! But with my current role I knew it would require some on-call time, because those are the products I'm helping to build and run.

So I spend part of my time improving the product support team so that everything is as resilient as possible, but also so that devs can understand how their code affects the products. This means being as involved in architecture and design decisions as with deploying infrastructure. And I stop at 40hr/week too, but once every 8 weeks, I get a few calls about broken shit, and I put in the time to help prevent those from re-occurring, because most people I work with don't.

sdevonoes · on Sept 23, 2020

> How is software dev breaking your body? Are you typing with your face?

I wrote about it. I consider my brain just like any other muscle of my body, so after 8 hours of work, my brain is quite tired. Perhaps my wording was wrong.

> If you want to be the best version of yourself within 9-5, please fix the bugs that alert on-call the first morning you hear about them. Most developers never do this, which is why they are put on call.

Agree.

> you are hired by a company to make a product that works, so if your product isn't working, you need to fix it

Agree as well, but my contract states "40 hours per week". I'm being plain straightforward here, I'm not willing to give anything for free to any company (does that happen the other way around? Never). Not sure what's "wrong" with this.

> And I stop at 40hr/week too, but once every 8 weeks, I get a few calls about broken shit, and I put in the time to help prevent those from re-occurring, because most people I work with don't.

I have nothing against that, and I respect it. I guess the other way around should work as well, right? Like, if someone is not willing to give more than what's stated in their contract, that should be fine for everyone. What you call "help" sure it's help, but companies are taking it as free labor. I have nothing against companies making money, but I do not support companies making money without having to pay employees for that. But the main point of my first comment was: even though companies are paying for being on-call, one should have the option to say 'No, thanks. I don't want to give you my free time in exchange for more money. I already have enough with my 40h/week schedule', and that should be fine for everybody.

samatman · on Sept 24, 2020

> How is software dev breaking your body? Are you typing with your face?

Desk work comes with significant bodily stresses which ergonomics can only partially ameliorate.

I'll take it over digging ditches, sure. But I won't do more than six hours of it a day, that's as long as I can healthily stand, and several hours longer than I can sit without pain.

condercet · on Sept 23, 2020

This attitude is terrible and exploitative. Employee's have a duty of fidelity and good faith to their employer, sure - but highly skilled and in-demand engineers should demand better of their employees.

stronglikedan · on Sept 22, 2020

> Most of the rest of the world works long hours.

Most of the world works labor jobs, and studies have shown that the body can work longer than the mind without burnout.

robmsmt · on Sept 22, 2020

enter devops

ocdtrekkie · on Sept 22, 2020

If you aren't doing production support, you don't actually know your product. You aren't connected to the pain points your users experience and you miss what is, to your support team and your customers, the glaringly obvious.

I would argue all developers should be required to do some support work.

ed25519FUUU · on Sept 22, 2020

Having skin in the game will make you a better engineer. You’ll get better at the non-sexy things: monitoring, alerting, testing, etc. You better believe a person who is pages for software at 3:00 am has an incentive to make that software more reliable.

ocdtrekkie · on Sept 22, 2020

I'm not even talking 24/7 pager work. But just tackling some support tickets as part of your job so you see where people are having issues with your product.

Too often I see BigCorp development teams seeming blatantly oblivious to where their pain points are, and it's because they aren't forcing their developers to do support. They're pushing code, but they aren't pushing code that solves real problems for people.

MattGaiser · on Sept 22, 2020

If you work for a Bigcorp development team, you are probably jus trying to keep a product owner happy. The user being happy or not will not ever impact your life, especially since you might well have moved on before it ever goes to prod.

lostdog · on Sept 22, 2020

But the directors and VP's aren't getting paged at 3am, so the incentives of the organization still go to lower quality software.

johnbellone · on Sept 22, 2020

If the directors and VPs aren’t getting paged at 3am you need better leadership. The first thing that I did was subscribe to all outages. And I let the operations center to call me anytime if they need help resolving a production issue.

cgrealy · on Sept 22, 2020

Sorry, this is a terrible argument.

No one expects customer support people to write code. Why? Because they don't have the skillset.

Yet people who make this argument seem to think any moron can do support.

The skillset for an engineer is not a superset of a customer support person.

Have your engineers sit in on support, by all means, but actually making them DO support will result in unhappy engineers and sub-par support.

Do not undervalue a good support person. They have a whole suite of skills engineers often don't have.

TuringNYC · on Sept 22, 2020

Dont most developers do level III support? Ultimately if the first and second lines of defense cannot solve the problem, from everything I've seen, it goes to an Engineer. If it is your product, i'd assume it comes to you. This has been the case at almost every company I've been at.

ocdtrekkie · on Sept 22, 2020

Don't dismiss the value of tier 1 and 2 support requests. Maybe 30% of your tier 1 requests are some confusion that support knows how to alleviate, but an engineer could make the product more straightforward to eliminate those support issues entirely.

You aren't looking for the hardest problems, you're looking for the problems your users hit the most that an engineer could reduce in the product.

jjirsa · on Sept 22, 2020

> You aren't looking for the hardest problems, you're looking for the problems your users hit the most that an engineer could reduce in the product.

This is the hard part of software.

jjirsa · on Sept 22, 2020

Why should a company hire two lines of support to shield an engineer from bugs they introduce?

octodog · on Sept 22, 2020

1. Not every support request is the result of a bug.

2. Not every bug is directly related to the product owner/team. Bugs can be introduced from different teams/processes.

3. Requiring senior engineers to do L1 and L2 support is likely a misallocstion of resources when less senior people can handle those issues.

ocdtrekkie · on Sept 22, 2020

1. The best team I've ever worked with considered every support request to be a bug. Is the user confused? At minimum, that's a documentation bug. It's also possible the product itself should be doing a better job stepping the user through a process or self-diagnosing.

3. https://news.ycombinator.com/item?id=24554399

hinkley · on Sept 22, 2020

4. Engineers have a bad habit of renegotiating scope in front of the customer without realizing they’ve done it.

imhoguy · on Sept 22, 2020

As usual, "that should be easy to fix, just line of code here". At least from SWE perspective the fix may be easy to make but architecture or product management team may not be so happy about such pitch, especially when customer responds with "oh great, when?".

jjirsa · on Sept 22, 2020

3 is a common mistake, like the one engineers make when they pretend that the hard part of programming is typing

inerte · on Sept 22, 2020

I am not sure most of the calls are bug reports. There's a customer support level, they triage. Sometimes what the customer think it's a bug it's a design decision (or lack of a feature) and the customer support knows another way of accomplishing the same task. Or they know it's not supported at all.

silisili · on Sept 22, 2020

I largely agree. My company has dedicated CS, a second tier 'triage', and well...me. I always prefer if customers just email me directly. CS is a frustrating, cost saving measure. And often I'll get tickets, 3 weeks later, like 'customer said they have an issue.' What's the point again?

seanwilson · on Sept 22, 2020

Another pro for consultancy work: you can chose contracts that that don't require you to be on call for production problems so this isn't forced upon you.

I'm not saying you wouldn't learn from working on production, but whether it's worth the stress is another question. In terms of software development, it's hard to think of a worse feeling than when you do a production deploy, you hit refresh on the website or whatever it is, and it shows a fatal error, then there's a mad scramble to roll back the change and figure out quickly what went wrong before the consequences grow too great. Most of the time bosses + coworkers aren't that understanding about it either and get into finger-pointing.

user5994461 · on Sept 22, 2020

Consultants charge 200% for out of hours work, if not more. I don't think they mind working extra hours.

They're never offered extra work though. Companies are always willing to wait for Monday when they are asked to put money on the table.

efitz · on Sept 22, 2020

I really like the author's first point, that one of his learnings was empathy for customers. Being on support for a product you didn't develop really calibrates you to understand what a product needs in order to be supportable, and where customers have problems with products. My time years ago in support was invaluable in shaping me as an engineer; I regularly push back against features that I know will be difficult to support or difficult for customers to understand.

AkshatM · on Sept 22, 2020

It sounds very much like the poster is describing an on-call rotation rather than "production support", which is a very different thing altogether.

Production support is customer support: responding to chat messages or communications from users.

An on-call rotation, on the other hand, involves responding to production incidents and mounting a proper incident response.

The Google SRE workbook has a great chapter on the subject: https://landing.google.com/sre/workbook/chapters/on-call/

johnbellone · on Sept 22, 2020

I don’t know why you’re being downvoted; these are two entirely different roles.

brailsafe · on Sept 22, 2020

I feel like the author is referring more specifically to troubleshooting issues when production goes down, which I'm fine with if I have no other people asking me for updates. But, I burnt out at my lost job trying to do support, because there was individual support baked into every contract for our SDK and not remotely enough people to handle it all. I was hired as a software developer, not a customer support person, and they are not the same thing. It's unfortunate, because it was my first and highest paying gig after realizing that I also have ADHD, and it was a good company. Thing is, if I have a problem to solve or task to complete, I'm not going to think about how long it's been since I replied to whoever about their pet issue. I'm just going to zone in on my thing, and if the guy next to me doesn't break me out of it by cracking eggs on the desk and burping, then I'll stay on that thread till it's done. That's how my brain works, and expecting otherwise is naive. Anyway, this constant context switching and battling my apparent insufficiency killed my spirit for the work and I turned into a blob of productivity. It's as stupid as expecting me to cook while programming, because either the food, myself, or the code will get burnt.

greesil · on Sept 22, 2020

At some large tech companies, production support as a software engineer does not seem to be a path to promotion, unless you are a junior level engineer. And yet, solving some of the bugs encountered in production environments, especially with a heterogeneous set of users, requires expert level knowledge of a particular software library. It is probably a great way to coast, so I hear.

goatinaboat · on Sept 22, 2020

It is probably a great way to coast, so I hear.

Or to stagnate, depending on how you look at it

hinkley · on Sept 22, 2020

Coasting is just stagnation in a different reference frame.

greesil · on Sept 22, 2020

One part of your life stagnates while the other parts move forward.

aprinsen · on Sept 22, 2020

Teams I've worked on have not had dedicated support staff, but rather engineers rotate 24 hour support duties on a weekly basis.

I have always had mixed feelings about "on call". I dread my turn on the rotation because the imminent threat of a prod issue has a psychological impact on my entire week, even off hours, and usually for a day or two after.

If everybody on the team feels that way, maybe it can act as a forcing function for product quality. I've seen this work on teams that already cultivate a strong sense of ownership.

On the flip side, it really stresses me out, and I sometimes resent that I'm not getting paid overtime for 24hr on call days. Maybe that's just baked into an engineer's salary these days, though...

MattGaiser · on Sept 22, 2020

The engineers at the organization I just departed (at least the ones in the support rotation which did not include me) got paid in both money and extra time off for their time spent on support tasks.

hinkley · on Sept 22, 2020

I find myself both applauding that and shrinking away thinking “perverse incentives”.

What I want is to run an engineering organization as if you should never have to call us. And if you do you either get chewed out for making a frivolous call, or we’re falling all over ourselves because that thing that is happening should definitely not be happening and we’ll be looking at how to keep that from ever happening again, again.

kevinmchugh · on Sept 22, 2020

Yeah, I had a job where only certain teams had on call rotations. Anyone in those rotations got an extra half day off per week on call.

I've also seen folks spotted extra time off for really gnarly oncall shifts. Folks should push to have such accommodations standardized.

user5994461 · on Sept 22, 2020

A half day off for doing 7 full days of extra work. It's peanuts. Better not do the rotation and have your 2 days of week end. (I assume they got zero extra pay and got called occasionally).

kevinmchugh · on Sept 22, 2020

On call shifts were very quiet and typically didn't involve any pages or interruptions. It was certainly not even a full day of extra work typically.

jonpurdy · on Sept 22, 2020

Anecdotal of course, but a couple of jobs ago I did on-call rotations for a week on every few weeks. I got paid some additional money and time off, but the psychological impact was too great for me and it just wasn't worth it.

wisecoder · on Sept 22, 2020

99% of the companies don't pay for On call rotation / Production Support. Exploiting H1Bs for On call support is very common practice in IT industry.

jake_morrison · on Sept 22, 2020

A lot of the pain around production support is easily solved by having staff in multiple time zones.

A good structure is to have first line support be relatively generic ops people. They can handle problems related to infrastructure, e.g. hardware failures, network problems, or issues that can be handled by adding resources. The deployment process should be consistent enough across applications that they can e.g. roll back to a previous release.

This covers the majority of production problems. After that, it's time to bring in someone who understands the details of how the application works. If the dev team is geographically distributed, then someone is available during working hours. Otherwise, we have to get someone out of bed.

If the dev team has done their job right, this should be a rare occasion. Making the dev team fully responsible for the reliability of the application means that they are motivated to make it reliable. Otherwise there is a tendency to have an underclass of ops people who get abused.

A fundamental mindset here is taking responsibility for the user experience, including reliability. If this is not owned by the product development team, then who?

mtberatwork · on Sept 22, 2020

Convincing folks at the top of food chain that more staff is needed is one of the most difficult things to do.

g051051 · on Sept 22, 2020

Our company is trying to move to "devops" and having dev teams on pager duty, doing manual reporting, etc. They seem surprised at the amount of pushback from the devs.

comeonseriously · on Sept 22, 2020

On the one hand, a lot of devs see production support as beneath them. On the other I think managers seem to thing they'll get a net productivity increase doing this, but there's not. Development gets much harder when you have to context shift several times per day. That being said, I do feel devs should do prod support. It gives them a better feel for their apps they build and how they're used and where the customer pain points are. But, I also feel there should be layers of support below the devs; devs should only get the ticket when they're the only ones left who can figure it out.

g051051 · on Sept 22, 2020

It's not "beneath" them, but it's an entirely different skill set. If you want me involved, approach me like a partner and we can work together, don't try to suddenly give me a job I can't do.

hermitcrab · on Sept 22, 2020

I am an independent developer living off 3 software products I created and sell. I have done all my own support over the last 15 years. While it can be frustrating (especially for B2C) it is also my superpower, as it gives me some much more insight into how I can improve my products. I only do support via email (not phone or chat).

jp0d · on Sept 22, 2020

It depends on what and how much one is learning from it. I once interviewed a guy for an ETL developer role. He was supporting an ETL application. In four years, all he had done was restart the application in case of any issues and it somehow worked for him. I've seen my fair share of such cases.

bcbrown · on Sept 22, 2020

This seems very telling:

> I no longer work at Gojek

MattGaiser · on Sept 22, 2020

Why? Engineers switch jobs all the time.

momokoko · on Sept 22, 2020

> Engineers switch jobs all the time.

This also seems very telling

MattGaiser · on Sept 22, 2020

Raises are nice?

hyko · on Sept 22, 2020

...because your company wants you to work two jobs for the price of one?

comeonseriously · on Sept 22, 2020

Engineers absolutely should do prod support. But... There should be layers below. It should only come to engineering when nobody else can figure it out.