slide

What i learned building an azure dev ops pipeline with terratest

Ned Bellavance
10 min read

Cover

I am in the process of creating a series of liveProjects for Manning Publishing, one of which involves deploying and managing an Azure Kubernetes Cluster using Terraform and Azure DevOps Pipelines. As I tried to build a CI/CD workflow that uses several pipelines, I ran into a bunch of roadblocks, some of which were my own doing and others which stemmed from a lack of clear documentation. While I won’t give away the contents of the liveProject - that would sort of defeat the point - I did want to call attention to things I stumbled over in hopes of helping others on a similar journey.

The Basic Premise

The basic premise could be a little tricky to grok, so I’ll start there. We are defining an AKS cluster using Terraform code. We can verify the functionality of the cluster by using Terratest to check both the infrastructure and applications running in the cluster. Terratest is pretty cool like that. My goal was to follow a GitOps workflow, where each stage of the overall pipeline matched an event in a GitHub repository. There are three events that should trigger a pipeline - spoiler alert, I didn’t necessarily grasp how GitHub and Azure Pipelines interact.

  • Push to a feature branch
  • Pull request to merge feature branch to main
  • Merge feature branch to main

My thought process is that any pushes to a feature branch should have the Terraform code checked for formatting and validity. Ideally, developers - yes that includes you infrastructure people too - would properly format and validate their Terraform code before committing and pushing to origin. In reality, probably not so much. Any code pushed to the GitHub repo that isn’t formatted and valid should fail basic Continuous Integration (CI) tests.

No one should be committing directly to main, so the next step would be to verify the code in a testing environment when a pull request (PR) is created to merge to main. We know the code is formatted and validated, but that doesn’t mean it will produce functional infrastructure. The PR pipeline will spin up a testing instance, validate functionality, and tear it down when complete. We could call this Continuous Delivery (CD), as the code should now be ready for deployment to production once it is merged to main.

Assuming the code passes muster in the PR, the last step in the workflow is to merge the branch to main and deploy the updated code to a production environment. This could actually be a multistage pipeline where it flows to staging or QA first, but I’m trying to keep things “simple”.

That’s the basic premise and I created three pipelines to execute it. Let’s examine the three pipelines a bit and walk through issues I ran into.

Continuous Integration

The first pipeline is CI, and it is checking Terraform formatting and validation using terraform fmt and terraform validate. The trigger should be any push on a non-main branch, which is done by simply excluding the main branch in the trigger block of the YAML file:

trigger:
  branches:
    exclude:
    - main

The formatting check is easy and doesn’t even require initializing Terraform. But validate does, which is where we run into the first roadblock.

You kind of need to use a remote backend for state data in a CI/CD pipeline. The hosted runner executing your pipeline is ephemeral and if you store your state data there, poof! It’s gone. That means you need to setup a remote backend and pass credentials to access it. Now you wouldn’t put the whole backend config in your Terraform code, right? You’re not an animal. So we go with a partial config using the azurerm backend, and pass the additional information including credentials with environment variables and the -backend-config switch. But where to store those values? Azure Key Vault of course! And wouldn’t you know it? There’s an option in the Pipelines GUI to create a variable group and link it directly to a Key Vault. How convenient! Unfortunately, that option doesn’t exist in the Azure CLI or Terraform provider for Azure DevOps. Yup, you read that correctly, there is an option in the GUI that is not available in the CLI, API, or Terraform provider. Grrrrr…

There is an undocumented API call you can make to create it, but “Ewwww!” Undocumented APIs are hot garbage. It roughly translates into “an internal facing API that could be changed at a moments notice or removed completely, and we probably won’t notify you because you shouldn’t be using it anyway”. If you want to complain loudly about this injustice, or at least up vote this issue on GitHub, I highly encourage you to do so.

The solution in this case was to use the AzureKeyVault task in the pipeline to load secrets as environment variables. Not the biggest stumbling block, but 100% annoying. All interfaces in Azure DevOps should be API first.

Continuous Delivery

Now we move into the next pipeline that should be triggered by a pull request. And here is where I ran into some difficulty. Part of this is poor documentation on Microsoft’s part and part of this is my lack of understanding when it comes to Git and GitHub.

There’s a special pr block you can define which will trigger a pipeline when a PR event happens in GitHub. Here’s what my block looks like:

pr:
  branches:
    include:
    - main

There’s a few things you should know. The include statement refers to the target branch of the merge, not the source branch. This confusion led to a non-insignificant amount of head to brick wall interactions for me. If you want it to fire when you create a pull request to merge a feature branch to main, the include statement should reference main.

The second thing to know is that the pr block stands alone in the pipeline, it does not get nested in a trigger block. When I was trying to figure out why my pipeline wasn’t firing, I thought maybe it needed to be nested, and that was definitely wrong.

The third thing to know is what happens when you create a PR on GitHub. Every time I created a new PR, both my CI and PR pipelines would fire. And I didn’t understand why. I wasn’t pushing code or creating a commit right? I’m just creating a PR! Why are both pipelines triggering?

Well, here’s the thing. When you create a pull request on GitHub, it actually creates a new ref and a commit for that ref. My PR pipeline is looking for pull requests and firing like it should. The CI pipeline is in a different file and it is looking for any commits that aren’t on main. The PR event in GitHub is a commit on a non-main branch, and so my CI pipeline is firing every time. In fact, the Microsoft Docs say so in a roundabout sort of way:

If no pr triggers appear in your YAML file, pull request validations are automatically enabled for all branches, as if you wrote the following pr trigger. This configuration triggers a build when any pull request is created, and when commits come into the source branch of any active pull request.

https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#branches

If you had to read that paragraph like three or four times, you are not alone. Because I have a separate pipeline with no pr trigger, it is firing on every PR event. The way to avoid this is add a conditional statement to the stages in your pipeline like this:

condition: eq(variables['Build.Reason'], 'IndividualCI')

The Build.Reason for the PR will be PullRequest and so the stage will not run when you create a new PR. The pipeline itself will still run, but it will skip all the stages and come back green. The stages will still run if you make new commits to an open PR, and that’s probably what you want!

You might be wondering if you can combine the pr and trigger blocks in one pipeline definition, and the answer appears to be yes. You can combine both the CI and PR pipelines in to a single pipeline and use a condition for each stage, job, or task to filter on when it runs.

Terratest and Authentication

Another issue I ran into while developing my PR pipeline was using Terratest to validate the AKS cluster and cluster components. The problem is when you attempt to use a Terratest module that leverages the Azure Client to connect. For instance consider the following code snippet to get information about an AKS cluster:

cluster, err := azure.GetManagedClusterE(t, resourceGroupName, clusterName, subscriptionId)

The code needs to create an Azure Client to connect to the Azure API and get the AKS cluster information. If you were running this locally on your desktop, you probably would already be logged into the Azure CLI. Terratest will use that cached login to connect. But if you’re running this in a pipeline, the hosted runner doesn’t have the cached login.

The same goes for the Terraform commands that leverage the AzureRM provider in a Terratest module. By default, those will also use the cached credentials from your Azure CLI login. The Terratest documentation calls out that you can use the standard environment variables: ARM_CLIENT_ID, ARM_CLIENT_SECRET, and ARM_TENANT_ID for authentication. Great! But hold on a second, that doesn’t seem to work for the Azure Client.

Turns out that the while the Terraform AzureRM provider uses one set of environment variables, the Azure Client from the Azure SDK uses another. Here’s the equivalent variables:

AzureRM Provider Environment VariablesAzure SDK Environment Variables
ARM_CLIENT_IDAZURE_CLIENT_ID
ARM_CLIENT_SECRETAZURE_CLIENT_SECRET
ARM_TENANT_IDAZURE_TENANT_ID

Table of environment variables in Terraform and Azure SDK

Confused yet? Why the two decided to use completely different variables is a mystery to me, but ultimately it’s not a big deal. Since I’m using environment variables for the AzureRM provider, I can simply create environment variables in my Go code by adding the following block:

// Create env variables for Azure client communication
os.Setenv("AZURE_CLIENT_ID", os.Getenv("ARM_CLIENT_ID"))
os.Setenv("AZURE_CLIENT_SECRET", os.Getenv("ARM_CLIENT_SECRET"))
os.Setenv("AZURE_TENANT_ID", os.Getenv("ARM_TENANT_ID"))

You could also use self-hosted runners with a Managed Identity associated with them (assuming you’re running them on Azure). Then both the AzureRM provider and Azure Client would be able to use MSI authentication. Or at least, I think that would work. I haven’t actually tried it.

Total Destruction

Azure’s API can be… problematic when it comes to fully destroying all the infrastructure Terraform has stood up. The PR pipeline builds a test instance of the environment, runs the tests in Terratest, and then tears everything down. Due to the eventual consistency nature of Azure’s APIs, sometimes the destruction fails due to a sequencing error. In my particular case, the AKS cluster was utilizing the Application Gateway Ingress Controller, which spawns a new Application Gateway in a dedicated subnet when the cluster is created.

That subnet requires a set Network Security Group rules that must be in place before the Application Gateway will be generated, and the rules cannot be deleted until the Application Gateway is removed. Therein lies the problem. Terraform will attempt to destroy the cluster and the rules at the same time, and sometimes the rules destruction will fail because the cluster destruction process has not yet torn down the Application Gateway.

If you run terraform destroy a second time, the process will complete successfully, since now the Application Gateway is gone. The whole apply and destroy process is happening inside my Go code, so I needed to write a function that would run destroy a second time. The original code looked like this:

defer terraform.Destroy(t, terraformOptions)

Where the defer keyword means that the command runs when all other tests complete, regardless of outcome. This will run destroy a single time, so I wrote a new function called DoubleDestroy:

func DestroyDouble(t terratest.TestingT, options *terraform.Options) string {
	out, err := terraform.DestroyE(t, options)
	if err != nil {
		out = terraform.Destroy(t, options)
	}

	return out
}

And then in the main testing function I changed the defer line to this:

defer DestroyDouble(t, terraformOptions)

Now the destroy process will run twice, and if that still fails it will error out. I suppose I could have done a while loop up to x number of tries, but really if the destroy fails after the second go around, I’d rather fail the whole thing and see what happened.

Conclusion

There were plenty of other headaches I encountered while trying to build the pipeline, and I don’t want to bore you with them now. I think this post is already sufficiently long. If you are planning to build an Azure Pipeline to automate Terraform deployments, I hope this information helped you. I’d also recommend checking out my videos on YouTube regarding building with Azure DevOps.