slide

Data transformation in terraform

Ned Bellavance
8 min read

Cover

I was attempting to do something I thought was relatively simple, and it broke my brain. The simple thing? Assigning permissions to namespaces in an Azure Kubernetes Service cluster through Azure RBAC using Terraform. Okay, well that part sounds complicated, but it’s not really important what exactly I was trying to do. The important part is that I was trying to do some data transformation in Terraform and the struggle is real. Which got me thinking about Terraform’s place in a workflow and the need for a general purpose programming language.

The Problem

Let’s start with the core problem. Imagine that you have an AKS cluster and you are using Azure RBAC to control permissions on namespaces. That is an actual thing you can do. Azure AD takes care of authentication and authorization, and constructs the resulting permissions for the Kubernetes cluster. Now let’s assume you want to manage the AKS cluster using IaC and Terraform. You can create role assignments using the azurerm_role_assignment resource and pass it the role, scope, and principal to apply it to:

resource "azurerm_role_assignment" "namespace_admin" {
  role_definition_name = "Azure Kubernetes Service RBAC Admin"
  scope                = "${azurerm_kubernetes_cluster.cluster.id}/namespaces/${var.namespace}"
  principal_id         = var.principal_id
}

The above resource block grants the principal stored in var.principal_id the role “Azure Kubernetes Service RBAC Admin” on the namespace stored in var.namespace.

So far, so good. But you probably want to grant permissions to more than one group and more than one namespace. You also want to make it dynamic, meaning you can submit a list of namespaces and admins, and then create those assignments with a loop. What would that structure look like? Well, why don’t we start with what seems like the simplest structure; a map with keys equal to the namespace and values being a list of admins for the namespace.

namespace_admins = {
  namespace1 = ["admin1", "admin2", "admin3"]
  namespace2 = ["admin1", "admin3", "admin4"]
  namespace3 = ["admin3", "admin5", "admin2"]
}

You might think you could use this variable value with a for_each meta-argument for the azurerm_role_assignment, but there’s a problem. The for_each resource will run once for each key in the map, but we have three entries for each key. We need to create nine total resources, not three!

The Solution

What we need is a way to convert this data type to a format that will iterate nine times and not three. We can do this with nested for expressions to first expand the map keys, and then the values in each key. The expression looks like this:

[ for ns, adms in var.namespace_admins :
    [ for adm in adms : {
        namespace = ns
        admin = adm
      }
    ]
]

The resulting data structure will be a nested tuple with maps as values. We can further apply the flatten() function to get rid of the nested tuple structure, and the end result is this tuple of maps, each representing the inputs for our azurerm_role_assignment resource!

[
  {
    "admin" = "admin1"
    "namespace" = "namespace1"
  },
  {
    "admin" = "admin2"
    "namespace" = "namespace1"
  },
  {
    "admin" = "admin3"
    "namespace" = "namespace1"
  },
  {
    "admin" = "admin1"
    "namespace" = "namespace2"
  },
  {
    "admin" = "admin3"
    "namespace" = "namespace2"
  },
  {
    "admin" = "admin4"
    "namespace" = "namespace2"
  },
  {
    "admin" = "admin3"
    "namespace" = "namespace3"
  },
  {
    "admin" = "admin5"
    "namespace" = "namespace3"
  },
  {
    "admin" = "admin2"
    "namespace" = "namespace3"
  },
]

The updated structure is no longer going to work with a for_each argument because that only accepts a map or a set of strings and we have a tuple of maps. We could use the toset() function to convert the tuple to a set, but it’s still going to contain maps instead of strings. The solution is to switch to a count argument and use the length() function on the data structure to determine how many map items we need to process. The updated code will look like this:

locals {
  namespace_admins = flatten([ for ns, adms in var.namespace_admins :
    [ for adm in adms : {
        namespace = ns
        admin = adm
      }
    ]
  ])
}

resource "azurerm_role_assignment" "namespace_admins" {
  count                = length(local.namespace_admins)
  role_definition_name = "Azure Kubernetes Service RBAC Admin"
  scope                = "${azurerm_kubernetes_cluster.cluster.id}/namespaces/${local.namespace_admins[count.index].namespace}"
  principal_id         = local.namespace_admins[count.index].admin
}

If you’re looking at that code and you don’t find it intuitive, you are not alone. I can’t even take credit for this solution. The truth is that I was banging my head against a wall for the better part of a morning trying to figure out the proper structure to use for my code, and finally I posted the question on the HashiCorp Ambassador Slack. My fellow Ambassadors swooped in to help and within half an hour I had a solution which would work.

My primary challenge in manipulating data structures in Terraform is that I am trying to apply imperative logic to a declarative format. I’ve spent years writing for loops in imperative languages like PowerShell. For expressions in Terraform, just like count and dynamic blocks, simply don’t work in a way I find intuitive. If I wanted to parse the initial data structure in PowerShell, I would simply use a nested for loop that looks something like this:

foreach($entry in $namespace_admins){
  foreach($admin in $entry.values){
    New-AzRoleAssignment -Scope "$($aksClusterId)/namespaces/$($entry.key)" -PrincipalId $admin -Role "Azure Kubernetes Service RBAC Admin"
  }
}

Or something similar to that. (I didn’t test the above, so don’t expect it to actually work). For whatever reason, my brain has a much easier time understanding imperative scripting than declarative code, and I don’t think I’m the only one to find this challenging.

The Need for a General Purpose Programming Language

Terraform is awesome for automating infrastructure deployment and management. That’s what it excels at. It was never meant to deal with complex data structures, and I don’t think it should have to. Many of the more recent updates to Terraform applied principles like data types, expanded functions, and complex objects. While that is appreciated, I think it’s also a double-edge sword. The addition of those features makes Terraform more of a full fledged programming language, instead of a representation of infrastructure in code. We’re adding complexity and potentially making our code harder to parse properly. What’s the alternative?

Consider CloudFormation for a moment, and I am not talking about the engine that parses CloudFormation. I am simply talking about the JSON or YAML templates. AWS has made a deliberate decision to avoid overloading their template language with functions and data structures. If it doesn’t fit into JSON, it doesn’t fly. There are a handful of functions for convenience, but you’d be hard pressed to call it a Domain Specific Language (DSL). It’s closer to a template format with a sprinkling of logic.

The result is that CloudFormation is extremely verbose, almost to a fault. If you want to do something programmatic in a template, you need to farm that work out to a custom resource that invokes Lambda. I’ve groused about this in the past. Like, why isn’t there a simple function to set a string to all lowercase so my S3 bucket creation doesn’t fail? Why do I need to list out each EC2 instance individually instead of using a count or loop on a single resource? And the answer is that CloudFormation is simple on purpose. Could AWS bake all that stuff in? Sure. Would that add a lot of complexity to the template language and parsing engine? Yup.

If you look at something like the CDK, which emits CloudFormation templates, the solution is to have a general purpose programming language (GPPL) create the templates as artifacts. All the logic and functions you might want to use in a template are instead evaluated as part of the GPPL code that emits the template. The fact that the template is verbose or lacking functions and complex data structures is beside the point. The template as artifact is not going to be read or edited by a human; rather, it is going to be fed into the CloudFormation engine for processing.

Terraform is taking a different tack, where the Terraform code is not being generated by another program, but the HCL is instead being written by a human being. Since that is the case, it requires all the bells and whistle of a GPPL, without actually being a GPPL. In a sense, since Terraform is written in Go, the functions and data structures as expressed by HCL are lifted directly from Go. Why not simply use Go as your IaC platform and have the result be valid Terraform code that is ingested by the Terraform executable for provisioning?

That’s exactly what the Terraform CDK aims to do. You can still leverage all the providers and processing from Terraform, but rather than writing directly in HCL, you can use a programming language you’re already comfortable with. Of course, that assumes you are comfortable with a programming language beyond basic PowerShell and Bash scripting.

Looking into my crystal ball, I see Terraform as having a long and fruitful life in the IaC space. I also see CDKs becoming more prevalent, with Terraform becoming the backend engine to get the actual provisioning done. Some projects, like Pulumi, are taking it one step further by leveraging some of the Terraform providers while using their own platform to generate and process artifacts.

Essentially what we have are two diverging paths for Terraform. One is the expansion of HCL to support more complex programming logic and data structures, slowly making it something closer to a GPPL instead of a DSL. The second is the use of CDK to programmatically generate HCL and feed it to Terraform for processing. Which one will win out? Probably neither. There’s always going to be some contingent of Ops folks who don’t want to learn a GPPL, and they will stick with just Terraform. On the other hand, there will be Devs who would rather use their knowledge of common programming languages, along with the attendant bells and whistles of a GPPL.

For my part, I plan to spend more time learning about the CDK for Terraform and Pulumi in the next year. In the long run, I believe that will be more beneficial to my career than simply doubling down on pure Terraform.