Thanks for posting this, I favorited it - having carved out a weird niche in my career as an "infra" guy, inevitably I deal with a lot of IAC. I run into this attitude a lot by devs - they are indeed annoyed by managing infrastructure, because it innately is not like software! I know I'm reiterating what you said but it is so important to understand this.
Here is a thing I run into a lot:
"Our infra is brittle and becoming a chore to manage, and is becoming a huge risk. We need IAC!" (At this point, I don't think it's a bad idea to reach for this)
But then -
"We need to manage all our IAC practices like dev ones, because this is code, so we will use software engineering practices!"
Now I don't entirely disagree with the above statement, but I have caveats. I try to treat my IAC like "software" as much as I can, but as you pointed out, this can break down. Example: managing large terraform repositories that touch tons of things across an organization can become a real pain with managing state + automation + normal CI/CD practices. I can push a terraform PR, get approved, but I won't actually know whether what I did was valid until you try to push it live. As opposed to software, where you can be reasonably confident that the code is going to mostly work how you intend before you deploy it. Often in infra, the only way to know is to try/apply it. Rollback procedures are entirely different, etc.
It also breaks down as others have noted trying to use terraform to manage dynamic resources that aren't supposed to be immutable (like Kubernetes). I still do it, but it's loaded with foot guns I wouldn't recommend to someone that hasn't spent years doing this kind of thing.
> I can push a terraform PR, get approved, but I won't actually know whether what I did was valid until you try to push it live
Our concession to this risk was that once a merge request was approved, the automation was free to to run the apply pipeline step, leaving open the very likely possibility that TF shit itself. However, since it wasn't actually merged yet, push fixes until TF stopped shitting itself
I'm cognizant that solution doesn't "scale," in that if you have a high throughput repo those merge requests will almost certainly clash, but it worked for us because it meant less merge request overhead (context switching). It also, obviously, leveraged the "new pushes revoke merge request approval" which I feel is good hygiene but some places are "once approved, always approved"
Here is a thing I run into a lot:
"Our infra is brittle and becoming a chore to manage, and is becoming a huge risk. We need IAC!" (At this point, I don't think it's a bad idea to reach for this)
But then -
"We need to manage all our IAC practices like dev ones, because this is code, so we will use software engineering practices!"
Now I don't entirely disagree with the above statement, but I have caveats. I try to treat my IAC like "software" as much as I can, but as you pointed out, this can break down. Example: managing large terraform repositories that touch tons of things across an organization can become a real pain with managing state + automation + normal CI/CD practices. I can push a terraform PR, get approved, but I won't actually know whether what I did was valid until you try to push it live. As opposed to software, where you can be reasonably confident that the code is going to mostly work how you intend before you deploy it. Often in infra, the only way to know is to try/apply it. Rollback procedures are entirely different, etc.
It also breaks down as others have noted trying to use terraform to manage dynamic resources that aren't supposed to be immutable (like Kubernetes). I still do it, but it's loaded with foot guns I wouldn't recommend to someone that hasn't spent years doing this kind of thing.