slide

Terraform taint is bad, and here's why

Ned Bellavance
8 min read

Cover

The terraform taint command marks an existing resource in state data for replacement. On it’s surface, this seems like a useful feature. However, it’s actually a ticking time bomb that can sabotage your environment. In this post, we’ll explore why taint is bad, and what you should do instead.

If you prefer a video version of this post, click here.

What Taint Does

Sometimes you have a resource in your Terraform configuration that just didn’t provision quite right or you need to force the replacement for reasons outside of Terraform. Maybe it’s a VM who’s setup script bombed and you want to replace it. Maybe it’s a storage bucket you need to empty out and recreate. Whatever the reason, you want to replace an existing resource without changing the configuration. That’s why terraform taint was created.

The taint command marks a resource in the Terraform state data as tainted. This means that the next time you run terraform apply, that resource will be destroyed and recreated. The configuration for the resource will not change, but the resource will be replaced. Let’s take a look at an example.

Taint Example

If you’re following along at home, the code for this example is in my Terraform Tuesdays repository.

In the below configuration, I have a single resource, an Azure resource group.

terraform {
  required_providers {
    azurerm = {
      source = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "main" {
  name = "tainted-love"
  location = "eastus"
}

Let’s assume I’ve already run terraform apply and the resource group has been created. Now I want to replace it. The command is terraform taint followed by the identifier for the resource. In this case it’s azurerm_resource_group.main.

$ terraform taint azurerm_resource_group.main

Resource instance azurerm_resource_group.main has been marked as tainted.

Before we run a plan, let’s take a look at what Terraform actually did. Running a terraform state show against the resource shows that it is tainted.

$ terraform state show azurerm_resource_group.main

# azurerm_resource_group.main: (tainted)
resource "azurerm_resource_group" "main" {
    id       = "/subscriptions/4d8e572a-3214-40e9-a26f-8f71ecd24e0d/resourceGroups/tainted-love"
    location = "eastus"
    name     = "tainted-love"

If we cat out the state file, the resource group has a property called status and it’s set to tainted. This is how Terraform knows to replace the resource.

$ cat .\terraform.tfstate

{
  "version": 4,
  "terraform_version": "1.6.6",
  "serial": 10,
  "lineage": "45c5895a-1117-8a59-8ae9-0a61d1b388db",
  "outputs": {},
  "resources": [
    {
      "mode": "managed",
      "type": "azurerm_resource_group",
      "name": "main",
      "provider": "provider[\"registry.terraform.io/hashicorp/azurerm\"]",
      "instances": [
        {
          "status": "tainted",
    # ...
}

Now let’s run a terraform plan:

$ terraform plan

Terraform will perform the following actions:

  # azurerm_resource_group.main is tainted, so must be replaced
-/+ resource "azurerm_resource_group" "main" {
      ~ id       = "/subscriptions/4d8e572a-3214-40e9-a26f-8f71ecd24e0d/resourceGroups/tainted-love" -> (known after apply)
        name     = "tainted-love"
      - tags     = {} -> null
        # (1 unchanged attribute hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Terraform comes back and let’s us know that the resource group will be deleted and recreated, and it also tells us the reason is because the resource is tainted. Good information to have.

If you want to undo the taint on a resource, the corresponding command is terraform untaint. And the syntax is the same as taint:

$ terraform untaint azurerm_resource_group.main   

Resource instance azurerm_resource_group.main has been successfully untainted.

If you look at the state file again, the status property has been removed entirely.

Which makes me wonder what else the status property is used for? I think it’s used with create before destroy actions to flag an older instance of the resource for deletion. More research is needed for that one.

Now if you run a terraform plan, Terraform says that no changes are necessary because the taint has been removed and the target environment matches the config.

terraform plan
azurerm_resource_group.main: Refreshing state... [id=/subscriptions/4d8e572a-3214-40e9-a26f-8f71ecd24e0d/resourceGroups/tainted-love]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

All this seems pretty copacetic, so why is taint bad?

Why Taint Is Bad

If you’ve been following the changes to Terraform over the last couple years, you know that HashiCorp is trying to move away from imperative commands and towards a declarative model for all operations that affect state. They are also trying to ensure that only the terraform apply command is used to make changes to state.

That’s why there is now a moved block and an import block, and soon there will be a removed block too. These replace the imperative commands terraform state mv, terraform import, and terraform state rm with a declarative counterpart. The idea is that you should be able to make all changes to state declaratively through the configuration, preview those changes, and make them with the apply command.

Why the change? Well, it’s all about state data and being able to preview changes without impacting other team members. If you run a terraform state mv or terraform taint command, you are altering the state data without making a change to the configuration. In a collaborative environment, this can cause problems.

For a simple example, let’s say that I need to replace a VM in my environment, but I can’t do it until after hours. So I run a terraform taint command to mark the VM for replacement. But I forget to tell my team members that I did this. One of them is making other changes to the configuration, and they run a terraform plan.

In the execution plan review, they completely miss my VM change, since there’s no difference in the code, and they aren’t looking for it. They approve the plan and run terraform apply.

Now my VM has been recreated during regular business hours, and I’m getting a call from my boss. Not good.

The tainted status is like a ticking time bomb in the state data, waiting to go off at an unexpected time whenever someone decides to run an apply. You are also altering state before you can preview the changes it might make to your environment, including other impacted resources. The normal cycle of plan, apply, alter state is flipped to alter state, plan, apply. That’s why taint is bad.

What To Do Instead

The alternative is simple! Starting in Terraform 0.15, the -replace flag was added to the terraform plan and apply commands. This flag allows you to replace a resource without changing the configuration. It’s the same as running a terraform taint followed by a terraform apply, but it’s all done in one command. You can also repeat the flag to replace multiple resources.

Since the replace flag can be use with the plan command, you can preview the changes before you make them, and more importantly, you can preview the changes without altering state data. This is a much safer way to replace resources.

If you save the execution plan, and someone else makes a change to the configuration and applies it, your execution plan will show as invalid when you try to apply it and you’ll know that you need to re-run it. The circle of infrastructure life is maintained.

Let’s try the replace flag with our previous example.

Replace Example

The flag exists for both the plan and apply commands, so I’ll run terraform plan -replace="azurerm_resource_group.main".

$ terraform plan -replace="azurerm_resource_group.main"
azurerm_resource_group.main: Refreshing state... [id=/subscriptions/4d8e572a-3214-40e9-a26f-8f71ecd24e0d/resourceGroups/tainted-love]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following       
symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # azurerm_resource_group.main will be replaced, as requested
-/+ resource "azurerm_resource_group" "main" {
      ~ id       = "/subscriptions/4d8e572a-3214-40e9-a26f-8f71ecd24e0d/resourceGroups/tainted-love" -> (known after apply)
        name     = "tainted-love"
      - tags     = {} -> null
        # (1 unchanged attribute hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Terraform comes back and tells us that the resource group will be replaced, and it also tells us that the replace flag was used. If we run a terraform apply with the same flag, we get the same result, with a prompt asking us to confirm the changes. Only then will Terraform make the changes to the target environment and our state data.

It’s just that easy!

What About Automation?

Now you might be thinking, “Ned, you said that HashiCorp was trying to move away from imperative commands and doing everything through the configuration. But the replace flag is still part of an imperative command. What gives?”

First-off, I’m impressed that you somehow added code formatting to speech. Second, you’re right.

The replace flag is still part of the imperative plan and apply commands, and if you’re running everything through an automation pipeline, there’s no easy way to use it. You’d have to kludge something together to make it work. Maybe using commit messages or PR comments? More research is required here as well.

Do I love this? No. It would be great to have a solid alternative for the replace flag in an automated setting. Hopefully, you don’t need to use the replace flag except in rare, break-glass circumstances, and you can use the declarative model for most of your changes.

Personally, I tend to use replace when I’m working on a new configuration and debugging some issue with a particular resource. Most of the time, it’s a one-off thing and I don’t need to worry about the automation workflow since I’m still running everything at the terminal. By the time it gets rolled out to a collaborative environment with automation, the configuration is working as expected and I don’t need to use the replace flag.

Conclusion

And that my friends is why terraform taint is bad actually. It makes changes to state data outside of the plan and apply loop, and it can cause problems in a collaborative environment. The replace flag is the preferred method for replacing resources without changing the configuration, but it’s not perfect either since it can clash with your existing automation workflows.