> in theory you're only going to do this exercise twice in a decade.
So I've seen things like this in corporations many times and it typically works like this...
Well trained team sets up environment. Over time team members leave and only less senior members remain. They are capable of patching the system and keeping it running. Eventually the number of staff even capable of patching the system diminishes. System reaches end of life and vendor demands upgrading. System falls out of security compliance and everything around it is an organizational exception in one way or another. Eventually at massive cost from outside contractors the system gets upgraded and the cycle begins all over again.
Not being able to upgrade these systems is about the lack of and loss of capable internal staff.
Fossilization and security risk is the cost. I'm dealing with one of these systems that's been around like 5 and a half years. It no longer gets security updates so has risk exceptions in the organization. But the damn thing is like a spider and woven into dozens of different systems and to migrate to a newer version is going to take, I'm estimating hundreds to thousands of hours of work on updating those integrations alone. Then you have the primary application and dealing with the multitude of customizations that would have been a stepped upgrade changing a little bit of functionality, now having to have massive rewrites.
The cost either way was likely millions and millions of dollars. But now they are having to do it all at once and risk breaking workflows for tens of thousands of people in a multitude of different ways.
Just upgrading the kernel on one of those "LTS" systems so that developers could start being ready for a kernel that wasn't 3.10 (and it turned out that core component of our app crashed due to... memory layout bug that accidentally worked on old kernels)...
I had to start by figuring all bits necessary to build not just kernel, but also external modules and attendant tools, using separate backported compiler because then-current LTS kernel wouldn't compile using distro-supplied GCC.
I've recently worked in a high profile company where it took them long and painful to move from CentOS 6 to 7 (over a year long effort, IIRC, finished for prod in 2021? but with some crucial corp infra still on 6 in 2022).
In 2022 they had to start a new huge effort do deal with migration off CentOS7, and the problems were so painful it was considered reasonable to build a Linux distro from scratch and remove all traces of distro dependency from the product (SaaS)
that sounds really interesting, can you elaborate on the challenges? why was it so important for them to move off CentOS7, and why didn't they move to RHEL or Alma or Rocky or whatever similar?
US Government woke up to the fact that allowing vendors waivers on requirements for upgrades ends up with nothing ever happening. CentOS7 is EOL'd next year.
Additionally, there was fun of FIPS-140 and OpenSSL older than 3.0.
Alma and Rocky were considered, but that would still involve (possibly similarly painful) migration as with CentOS 6 -> 7.
Have you seen pricing for RHEL? We're talking hundreds thousands of systems. I never seen raw stats, but I would have been totally unsurprised to see them hit million instances across all clouds used, at least occassionally.
Decoupling software from distro dependencies was seen as a way to future proof deployment story and avoid situations like we had with CentOS 7, where they really, really would have liked upgrading some stuff for newer APIs, but couldn't due to mess with OS-provided dependencies.
decoupling meant something like using "distroless" or static builds (musl?) or simply shipping everything on an alpine/ubuntu/debian/whatever image? (and previously there was no containerization, but now there is)
So I've seen things like this in corporations many times and it typically works like this...
Well trained team sets up environment. Over time team members leave and only less senior members remain. They are capable of patching the system and keeping it running. Eventually the number of staff even capable of patching the system diminishes. System reaches end of life and vendor demands upgrading. System falls out of security compliance and everything around it is an organizational exception in one way or another. Eventually at massive cost from outside contractors the system gets upgraded and the cycle begins all over again.
Not being able to upgrade these systems is about the lack of and loss of capable internal staff.