Why would not all information about the last known state be expressed in the sou...

oneplane · on Oct 8, 2023

Because the source might no longer exist. Same as the current state might no longer exist. You can't diff something you don't know about and thus cannot create a transaction to deal with it.

Three examples, but first context:

You create some .tf files that create a bucket on s3 on AWS, and then create an object inside the bucket. This creates a graph data structure that describes the AWS provider, the S3 bucket and the S3 Object. You then apply this desired state, which causes terraform to create some resources in AWS and then describe the state as it was last known in the state file.

Scenario 1: You delete the object from AWS but don't tell Terraform about it. You re-run apply, terraform notices that it is gone and re-creates it, so far so good. Even Ansible and SaltStack could do this.

Scenario 2: You delete the object from the desired state in your .tf files. You re-run apply and terraform notices that you used to have an S3 Object in your desired state (because your state file says so), but since it is no longer there it goes to talk to the AWS API to see if the object is still there and if it is, it will delete it (or, plans to delete it and then it is up to you to apply or discard that plan). This is not possible if you don't record your previous intent. One-way tools fail here.

Scenario 3: you make a mess where you remove the object from your .tf files and also manually edit the state file and remove the object there as well. Terraform now knows nothing about what the world used to look like. But you then decide that the S3 bucket can be removed and since it's all managed resources anyway (no dirty clickOps tricks in the console) you tell Terraform to destroy the bucket. Terraform doesn't see any dependencies, and asks aws to nuke the bucket. But since the bucket isn't empty, you get an error. Terraform doesn't know anything about any objects in the bucket, and thus it couldn't clean it up.

Now, the scenarios here are oversimplified, and you would probably not setup an IaC tool to manage 1 bucket and 1 object if that is all you'll ever configure. But when you start creating multiple environments that each need their own configuration drift control and controlled resource destruction you really can't function without it.

gaganyaan · on Oct 8, 2023

Is there a functional difference between that approach and using tags on existing resources? I know Terraform doesn't use tags for this because not everything supports them, but in K8s, scenario 2 is handled by just saying "make everything using this tag look exactly like this". If there's extra resources not in the YAML with that tag, they'll get cleaned up.

In general, I like that approach better, because it just uses the state that already exists. It would be kind of nice if Terraform had a mode where it avoided creating extra state where possible.

oneplane · on Oct 8, 2023

Yes, because tags do not express where in a graph a node might be. It also gives you a new problem, because to find out what the previously known state was, you now have to read all tags on all resources.

In Kubernetes, the state reconciliation happens by controllers based on information in etcd, in a way, etcd is what the statefile in terraform is. The apiserver is what the 'current' state is, including metadata and event fields, and any on-disk manifests you might have would be the desired state. The only big difference between the Kubernetes reconciler and terraform is that event-driven nature. Terraform tends to be a series of one-shots where Kubernetes (the controller manager and controllers) is a constant stream of reconciliation events.

The overarching theme in both is 'desired state reconciliation via declarative configuration'.

As for client-side apply and server-side apply in K8S, you're essentially having the same deal as terraform: you feed it a manifest of what you want, and it will figure out if it is new, old, needs to be updated or needs to be destroyed.

ta1243 · on Oct 9, 2023

The concept I struggle with is

"the desired state, the last known state and the current state"

Why do you need to know the last known state. You have the desired state and the current state, you run your code and you reconcile the current state to the desired state

I get why this is impossible/expensive to get the current state with AWS, which doesn't expose a simple "show state" API, but managing a state store is extra work and thus extra fragility.

However ideally I'd like to list credentials and providers to manage, then say "manage all resources with some form of tag "managedByTerraform=12345" (this could even be in the resource name -- for my fortigate management my scripts manage firewall rules, addresses, etc which start with AUTO_, and ignore the other objects)

It would then run, generate a lock resource, find the current state, compare with desired state, create/delete elements to reconcile actuality with desire, implement, free the lock

Then all I need to do is have that human readable file and I can get back to where I am

If I lose the state, or it gets corrupt, I'm in for a world of hurt with terraform.

oneplane · on Oct 9, 2023

What you propose is computationally rather expensive, and on an API level really hard to implement uniformly. And thus the state was born.

Trying to do it without state has been tried plenty of times, and it's bad. (Chef, SaltStack, Ansible, CFengine, Puppet to name a few)

It only works somewhat okay if you're simply writing to a single managed unit, i.e. a file. You know what all the contents of the file have to be, so you can replace it wholesale. This does not work for a graph of resources that interact with each other.

Separately, this also doesn't work with a shared responsibility model and doesn't work with deleted code.

Terraform has a state because that is required to solve the problem in a reasonable manner. It doesn't have state just because it was thought to be a fun exercise ;-) If the problem being solved is not _your_ problem, it is highly likely that using terraform at all is also not going to help you.

> If I lose the state, or it gets corrupt, I'm in for a world of hurt with terraform.

And that's why you secure and version your state. I.e. using S3 and versioning and a restrictive policy.

devonbleak · on Oct 8, 2023

Because you can't guarantee that everything (or even anything) in the source was actually applied. Whether it's because there was some error in doing so or because there just was no run.

Additional potential issues: how are you generating your primary keys before resource creation? Probably by some hashing of attributes known at creation time? And how are you avoiding collisions? How are you dealing with modules that may have gone through multiple commits/versions since the last time you ran an apply?

There's just too many places where you really want to have a mapping of at least logical IDs -> physical IDs to ensure consistency. The one big advantage CloudFormation has here is that it just handles the state for you, but there's definitely still state.

igetspam · on Oct 8, 2023

Because it's incredibly difficult to do. Look at just something as simple as launching an instance. How do you know the state of the instance ahead of time? The instance ID doesn't exist and you can't do search filters for it because it's not there yet. Yes, you could use some half baked conditional logic to try and search and load values only if your instance exists and create if empty but terraform really isn't good at that. And that's just the very simple case. You could use something like the CDK or pulumi and avoid state but it's still work and it's useful work.

verdverm · on Oct 8, 2023

I think a simpler example is how do you know to delete something. It won't be in the config any more, so won't be queried, unless you have some tag on managed resources for this module, but then you'd need to persist that somewhere

jacurtis · on Oct 8, 2023

This is the biggest problem that I think the author left out.

For those that don't know, the way you delete a resource in Terraform is to remove reference to it in your code.

So for example to create an EC2 instance, I make an EC2 resource block (or use a module that itself references one). Then run Terraform. Terraform looks at see that this ec2 instance isn't in the state, but it is in the code, so that must mean it is new and I need to create it. So it creates it and then adds the metadata about that resource into the statefile.

Now to delete it, I simply delete the resource block from my code and run terraform again. Terraform looks at the state file and sees that there is supposed to be an EC2 resource, but doesn't see it in the code. So it then deletes it from the cloud provider and then deletes it from the statefile. That's how you delete.

So if you remove statefile, then how does Terraform know that you deleted something? It isn't in the code so it doesn't even "think" to check on it. You need the statefile to remind Terraform that it used to exist. Again, creating an object is easier because you have it defined in the code and so Terraform can try to reconcile it with an existing resource in the cloud provider, and if it can't reconcile a pre-existing resource than it can assume to create one. But if you delete it without a statefile, then Terraform can't know it exists. The author seems to know this, which is why they suggest using Git to check the last time the code was run, see that there used to be a resource block and that it has been deleted. Then Terraform can conclude that it should delete that resource from the cloud provider.

My problem with this is that in order for Terraform to work you must preserve your git history. Which does generally happen but in enterprise environments we have had to do some nasty voodoo with git on occassion, and I fear that this could significantly mess up my cloud infrastructure as a result.

Trust me, I have spent a lot of my time reconciling and repairing statefiles over my time as an SRE. I know tf state very intimately. It is a hell of a lot easier to repair a statefile (which is just json), than it is to repair a git commit history. And I live in a world where you have to assume that the day will come that you need to perform this task. When that day comes, I will be happier repairing a json state file than git commit history.

baby_souffle · on Oct 8, 2023

This is correct.

If it helps, you can think of your terraform files as HEAD and the state file as HEAD~1

verdverm · on Oct 8, 2023

How are multiple branches handled? How would locking and conflict resolution work?

Imagine two branches trying to update staging using gitops

jen20 · on Oct 8, 2023

They aren’t handled - the reality of the situation is that the subversion model is better suited to infrastructure than Git.

oneplane · on Oct 8, 2023

Only if the infrastructure you target has no concepts of versions, and if it only has monolithic environments (which still holds true for much of the older infrastructure).

But if we just take the GitOps branches problem, they are indeed not handled, but that's because you generally configure your GitOps pipeline to only allow interaction from a specific ref, not any branch you might have lying around ;-) The SVN version would either be trunk-based deployments or revision-based, where you have to tell your reconciler which revision you want.

jen20 · on Oct 9, 2023

It’s because infrastructure largely does not have the concept of versioning, and branching infrastructure just makes no sense.