Julie Ng's Blog

Setup git commits and authentication with multiple GPG keys and YubiKeys

2024-01-13T01:00:00+01:00

Since I worked as an architect in the compliant financial industry, I have been signing my git commits so that people cannot impersonate me in source code. I have always defaulted to a single GPG personal key that I could also use for both personal and work. But suddenly I needed to juggle two keys.

Separate YubiKeys for personal and work use

Please note that technically, it is completely possible to juggle multiple GitHub users for commits without any GPG and YubiKeys. This article is only applicable if you have multiple GitHub accounts setup with signed commits.

Result preview

All the steps below describe how to ultimately end up with a green “Verified” badge for my work commits with setup:

❯ gpg --list-secret-keys
[keyboxd]
---------
sec>  rsa4096 2018-05-27 [SC]
      121E4BXXXXXXXXXXXXXXXXXXXXXXXXXXX
      Card serial no. = 0006 121XXXX
uid           [ultimate] Julie Ng <redacted>
ssb>  rsa4096 2018-05-27 [E]
ssb   rsa2048 2019-09-28 [A]

sec>  rsa4096 2024-01-13 [SC] [expires: 2026-01-12]
      5F9DE7XXXXXXXXXXXXXXXXXXXXXXXXXXX
      Card serial no. = 0006 1095XXXX
uid           [ultimate] Julie Ng <redacted@microsoft.com>
ssb>  rsa4096 2024-01-13 [E] [expires: 2026-01-12]
ssb>  rsa4096 2024-01-13 [A] [expires: 2026-01-12]

Note that both key are on different Card serial no.s and that the ssb> with > indicating that the key is on a smartcard, i.e. not on my computer. See gpg manual for details.

Use case and problem

Increased security

The basic use case is that I have two (2) accounts I need to juggle:

a personal account for personal open-source work that is public on GitHub.com
a work account that is managed by Microsoft for any internal repositories

The problem is that my work account is managed by GitHub Enterprise Managed Users (EMU), which means the identities are synced to an external identity provider.

The account is completely managed in our internal Entra ID tenant, which of course does not include my personal email. Because the account is externally managed (and merely synced to GitHub enterprise), there is no possibility for me to “verify” my personal email, so commits signed with my personal email always show up as “unverified”:

"Unverified" commit when signed with personal email

This is clearly visible via the yellow-ish “Unverified” badge as well as my username not having a photo. It is a problem for pedantic Julie to have “unverified” associated with my work. So I fixed it by adding a second key

Why a second YubiKey?

I need a second physical YubiKey because my existing key already is already full. YubiKey’s OpenPGP application can only hold up to three private keys, which are separate private keys for encryption, signing, and authentication.

Why a USB Type-A key?

I have two different YubiKey types because originally I bought the Type-A version for use with a work computer, a Windows Surface computer without USB-C. I abandoned the PC and use my personal Mac, which is company managed for work. Now I need a USB adapter to use this key. But I’m too cheap to buy another YubiKey.

Generate GPG keys with work email

Step 1 - Generate new keys

As noted above, I bought the Type-A key ages ago. I had a private key on it. But it was borked. So I just generated new ones following Okta: Developers Guide to GPG and YubiKey

gpg --full-generate-key

I created the new key using my work email address. See the Okta guide for full steps.

Step 2 - Move private keys to YubiKey

See the Okta guide for full steps.

gpg --list-keys
gpg --edit-key <KEY-ID>
gpg > keytocard

Step 3 - Export public key and add to GitHub

Now the private keys are stored on the physical key, which we’ll need to sign our commits. We want to share the public key with GitHub so they can verify our signatures. First we export it.

gpg --armor --export USER@COMPANY.com > public.key

And then add copy and paste contents from public.key via this documentation: Adding a GPG key to your GitHub account.

Then delete the file rm public.key for good housekeeping.

Step 4 - Configure repository with work user

The majority of my work is public open source or personal, so my global git settings use that email.

So I have to configure each internal repository manually by going to the internal project folder and configure git using the --local flag:

git config --local user.email <WORKEMAIL@microsoft.com>
git config --local user.signingkey <KEYID>
git config --local commit.gpgsign true

My name is the same, so I only need to configure the email and specify the work specific <KEY-ID>. Voila, you’re done and whenever you make a git commit, you will be prompted to insert your YubiKey and unlock it with the PIN.

Configure authentication with multiple accounts

Now we can sign commits on our local workstations. But our multiple accounts will also need multiple authentication mechanisms. How do we juggle that?

First, you first need to create personal access tokens for both your personal and work accounts on the GitHub website. After that there are multiple options to juggle the tokens.

Warning - NEVER store your personal access token in your repository’s remote url

Never store tokens in a URL. Although it works it is clearly visible in plain text and not secure. There are many bad articles on the internet that suggest this awful method:

# Do NOT do this - insecure!
git remote add origin https://<USER>:<PAT>@github.com…

I have also seen this in my work with customers. Do yourself a favor and take some time to understand how configuration, authentication and security work for git. It is one of those technology/company agnostic skills that’s valuable for life. So invest the time.

Option 1 - `git-credential-manager`

The most straight-forward way that works for all git providers (not just GitHub) is to use git-credential-manager (GCM). It works and integrates will with operating system password managers. But be sure to configure it to use encrypted stores, also for caching. And the real challenge is managing multiple users. See this GCM documentation on how to manage multiple users, which is fairly complicated.

I do not use git-credential-manager because I want to use my YubiKey and there are better otions for Mac users. So this option is not for me.

Option 2 - GitHub CLI

If you only need GitHub, a newer and better way is to use the GitHub CLI, which will automatically cache your credential security (if possible) Cache GitHub credentials.

I need more than GitHub. And I want to use my YubiKey. So this option is also not for me.

Option 3 - Permanent authentication with encrypted `.netrc`

From a great developer experience perspective, I do not want to juggle multiple credentials. It should just work and be secure. This setup requires understanding that under the hood, git uses curl. And curl supports netrc. The git-credential-netrc helper is built-in and does not require additional software like the other options described above.

Although I do not need additional software, I do need to encrypt the .netrc file:

# Encrypt the .netrc file (using personal key in example)
gpg --encrypt --recipient <user@email> -o .netrc.gpg .netrc

and then configure git to use this file:

git config credential.helper = 'netrc -f ~/.netrc.gpg -v'

Read on to learn about what’s in that .netrc file.

Managing multiple accounts

Identity based authentication is always complicated. If you read the official git doc on configuring credentials you’ll understand what the GCM and GitHub CLI is doing under the hood - setting custom configurations with hostname or path matching. If you are using GitHub enterprise server, you’ll have a separate host, e.g. github.mycompany, which is easier to configure. Generally, if you have GitHub Enterprise Cloud, you will need path matching because both use github.com.

Updated 20 January 2024
This article has been updated to fix problems with my workflow. Be sure to include machine github.com twice in the .netrc file and configure git default credentials as described below.

# .netrc (when decrypted)
machine github.com
  login <personal-user>
  password <personal-pat>
machine github.com
  login <work-user>
  password <work-pat>

1 week later… more complicated than thought

I did not thoroughly test. I discovered it worked as an unencrypted .netrc, but not as an encrypted .netrc.gpg, which was maddening to debug.

Ultimately I added the -d flag to the netrc credential helper and saw it could not pick up the user when I only listed machine github.com once. Based on that I made a few changes to get all of this working.

Auto-toggle users

Added second machine github.com line to the .netrc file
Specified default users in the global ~/.gitconfig:

# .gitconfig - comments for article only
[credential]
    helper = netrc -f ~/.netrc.gpg -v
    # default to prsonal user
    user = julie-ng
    # specify credential helpers should match paths, not just hosts
    useHttpPath = true
# specify work user for work specific repos
[credential "https://github.com/<WORK_ORG>/*"]
    user = <WORK_USER>

Finally in my desperate debugging, I had also specified the user explicitly in the git remote URLs, for example:

# if not using .gitconfig above
git remote set-url origin https://<WORK_USER>@github.com/<WORK_ORG>/…

I have sinced removed the users from my remote URLs now that everything is in the global git configuration. Finally 😅

Conclusion

So we jumped through all the hoops of…

creating multiple GPG keys
storing our private keys on multiple YubiKeys
configuring out git clients to be able to handle multiple authentication credentials

…just so I can get a green “Verified” badge and see my photo 😅

"Verified" commit when signed with work email

In all seriousness, the hoops are worth it to me to ensure no one can impersonate me and if my someone got access to my computer it’s impossible for the hackers to get access to credentials without access to physical YubiKeys and the PINs to unlock them.

Infrastructure as Code and Monorepos - a Pragmatic Approach

2022-01-05T01:00:00+01:00

As engineers move beyond “hello world” samples, they can struggle extending the code to multiple deployment targets and creating automation pipelines. How can we structure code for re-use and automation and ensure we won’t accidentally deploy to production?

There are many ways to do this. In this article I will share one solution that uses a monorepo to deploy and manage multiple Kubernetes clusters. The source code is public, maintained and available at github.com/julie-ng/cloudkube-aks-clusters.

Disclaimer this article is a mix of best practices and walkthrough of a specific high trust use-case. Your requirements may differ.

What is a Monorepo?

In the context of cloud infrastructure automation, a monorepo approach refers to a single repository that holds both:

deployment templates
deployment configuration

which has the following consequences.

Advantages

Easier to understand
Faster to debug when configuration is next to template code

Disadvantages

Urge to “copy & paste” code to make debugging easier than having to correlate code between two different repositories with separate git histories.
A single repo means only 1 security boundary in git

Because you cannot use folders as a security boundary in git, anyone with write-access to the monorepo can trigger deployments, incl. to production. It is possible to introduce a soft boundary by using a combination of Pull Request workflow and protected branches. But organizations with stricter requirements to remove write-access from developers will adopt the multi-repo approach.

My Use Case

In my cloudkube-aks-clusters project, I do not need such a security boundary because it’s just me and thus a high trust scenario.

Leverage Software Modules for Multiple Environments

Do not use copy and paste ever. For work in progress code, leverage git branches. If you are not experienced with creating software modules, start with a single giant file to make progress. When you can deploy the infrastructure you need (but do not wait until it’s perfect), refactor into modules to follow software programming best practice and DRY your code.

Once you DRY your code, you will have an abstraction for your environment, which will include all the compute and data infrastructure for your workloads. Your abstraction will have different syntax depending on the language you choose. If you’re using Pulumi and JavaScript, your abstraction may look something like this:

// example IaC Module pseudo-code
const AppEnvironment = require('custom-module')

const dev = new AppEnvironment({
  name: 'dev',
  postgresVersion: '14.1'
})

const prod = new AppEnvironment({
  name: 'prod',
  postgresVersion: '13.5'
})

I prefer Terraform, but the concepts of software modules and parameters are generic and will also apply to Azure Bicep modules and the modules ecosystem of the Pulumi language you choose, e.g. npm packages or npm modules for JavaScript.

Why are Custom Abstractions necessary?

Official modules, e.g. official Microsoft managed Terraform module for an Azure Kubernetes cluster are bare-bones by design. Many small modules allows for greatest flexibility in customizing your architecture. To get started you may use the official provider to create and deploy a resource, e.g. the Kubernetes cluster. In real life, you will eventually need to add your own specific requirements.

My Kubernetes Cluster Requirements

For example, my aks-cluster module add some security and automation resources on top of my Kubernetes cluster:

Virtual Networks for cluster integration
Headless security principal to be used by cluster ingress controller to fetch TLS certificates
Headless security principal to be used in CI/CD automation
An Azure Key Vault for Kubernetes secrets integration
etc.

Note these resources are created per environment to follow Principle of Least Privilege, which is one of my specific requirements. Your requirements may differ.

Separate Configuration Files per Environment

Once we’ve created an IaC module, we can re-use the same code for multiple deployment environments. This is done using separate config files per environment. The IaC configuration only needs to know which parameters to set, for example this excerpt from my dev.cluster.tfvars:

# module config (excerpt)
name               = "cloudkube-dev"
env                = "dev"
hostname           = "dev.cloudkube.io"
kubernetes_version = "1.20.9"

Treat your module like software and provide documentation so engineers know how to use it. At a minimum you should document required parameters and default values.

Pro Tip - if you are using Terraform you can autogenerate module documentation with terraform-docs.io. See this generated README.md summary, which saves me the trouble of having to open and read multiple Terraform files. Be aware the docs are only as good as your coding.

Use Subfolders Per Environment Configuration

Now that we have re-usable IaC modules, the next challenge is to setup automation pipelines that do not unintentionally deploy to production. This is a common fear for engineers getting started with DevOps and CI/CD.

The hurdle here is to understand that while most beginner pipeline documentation focusses on branch triggers, pipelines also have path triggers and you will need both. Unlike application pipelines however, your deployment target will be determined by paths not branches.

To better understand this, let’s walk through an example.

Leverage Path based Pipeline Triggers

Given the following file tree structure (with example multi-region production scenario)…

environments/
├── dev/
├── prod-northeurope/
├── prod-westeurope/
└── staging/

and given the following pipeline triggers…

# azure-pipelines/production.yaml
trigger:
  branches:
    - main
  paths:
    include:
    - 'environments/prod-northeurope/*'
    - 'environments/prod-westeurope/*'

I could then create pipelines that only run against production environments IF…

a commit is pushed to the main branch
a change is made to configuration files inside prod-northeurope and prod-westeurope subfolders.

Work in Progress Changes

In this scenario, I can actively make changes to the Terraform modules code in the modules/ folder, but automated deployments using the triggers below will not run against production until the changes are made to the environments/prod… folders.

To better illustrate the various triggers, let’s map the corresponding deployments into a table.

Pipeline	Branch	Path	Deployment Target
`ci.yaml`	`*`	`*`	-
`cd.yaml`	`dev`	`modules/*`	DEV
`cd.yaml`	`dev`	`environments/dev/*`	DEV
`cd.yaml`	`main`	`modules/*`	Staging
`cd.yaml`	`main`	`environments/staging/*`	Staging
`cd-production.yaml`	`main`	`environments/prod-northeurope/*`	Production (North Europe)
`cd-production.yaml`	`main`	`environments/prod-westeurope/*`	Production (West Europe)

Leverage Resource Tagging and IaC Versioning

Sometimes changes may be under the hood improvements, e.g. refactoring the infrastructure as code. But you should still deploy to production to confirm that the infrastructure does not change. You can test this without changing the infrastructure by using tags. See this Azure documentation for example common tagging patterns. Resource tagging is a generic concept also offered by other cloud providers.

Using tags is straight-forward and a general good practice. Just as I tag my resources env:staging, I could also tag them iac-version:1.28 and bump the versions according to your schema. I prefer semantic versioning.

Infrastructure as Code Rollbacks

Rollbacks are a part of real life cloud engineering. And they ARE scary. But over time and with experience, it is straight-forward to rollback configuration changes with confidence.

Dev and Staging Rollbacks

Non-production rollbacks are expected to be messy because they are not versioned.

Personally, I generally only track these by the git branch heads. So if I need to undo a change, I need to change the code. Because I don’t like waiting minutes for CI builds, I tend to apply Terraform changes locally and check in the code after I have the result I want. So when the pipeline runs against a remote backend, it won’t find any configuration changes and not execute terraform apply.

The trade-off here is the risk of the “it works on my machine” effect. In the cloudkube-aks-clusters repo, I’m the only contributing engineer and thus the risk is low. Your mileage will vary.

Production Rollbacks

The key here is discipline when using git. In general for both application and infrastructure workloads, I tend to track production code with with both

production or main branch heads, i.e. tips
git tags in semantic versioning format. See example CHANGELOG.md created with standard-version from this cloudkube-aks-clusters repo.

Using tags, I have a clearer overview of intended deloyments. In the simplest scenario, if I am at v0.3.0 but want to rollback to v0.2.1, I would return the code to that previous point (preferably without a force push) and re-deploy.

Deployments outside of Pipeline Runs

Most engineers understand that for example the v0.3.0 deployment might be numbered deployment #86 and the rollback is deployment #87. But there can also be a deployment gap, for example:

Deployment #	Trigger	Details
86	git push	Pipeline Deploy of v0.3.0
87	Nightly Scheduled Run	Resolve configuration drift that someone did in Cloud Provider UI
88	Manual or git push	Pipeline Deploy or Manual Rollback to v0.2.1

In this example an engineer intentionally triggers deployment #86 and #88. But the deployment #87 runs in between and may be triggered outside the normal engineer workflow. It can be en engineer who configured their own scheduled nightly runs. It can also be change that is enforced centrally, for example if an organization uses Azure policy to strictly enforce governance of their cloud real estate.

To have predictable and reliable infrastructure, you need to be aware of all the ways deployments can happen, including the ones outside of your control.

When should you not use a Monorepo?

Starting with a monorepo is the quickest way to deployment. It’s simple but also very versatile for the experienced engineer.

So when should you not use a monorepo? That will be a future article ;-) Follow me on Twitter and YouTube to be notified when it gets published.

P.S. Props to anyone who went through the julie-ng/cloudkube-aks-clusters code and noticed it does not have any pipeilnes. That’s another way to approach security - remove automation altogether ;-)

In all seriousness, if you’re looking for infrastructure as code pipeilnes, checkout the azure/devops-governance repo, which follows a very similar pipeine structure to the one described above. That project includes pipelines because they deploy to a Visual Studio Enterprise subscription on my personal Azure AD tenant. The Kubernetes clusters project described in this article is deployed to a Microsoft internal Azure subscription. So there are no automation pipeilnes in this public repository to be better safe than sorry.

CI/CD Review - How DevOps in Real Life & Mature Organizations works

2021-03-01T01:00:00+01:00

People love checklists because they give the illusion of an easy success. But DevOps is not straight-forward and looks different for each team and application. That is why I conduct reviews with Azure customers as an engineer at Microsoft in an interview-style discussion. And like an interview, I’m not challenging your answers, but your thought process.

Want some more context and answers? Watch me walkthrough some of these questions and share examples from real life.

Goal of the Review

This exercise focuses on DevOps in practice, not in theory. After going through the questions, you should be able to better gauge your confidence in your practices meeting the requirements of your use case. It will also help you figure out what is the next practice you want to improve upon.

Most of us don’t meet every requirement and do everything listed below all the time for every project. Re-visit this exercise every now and thing and continuously improve.

Keep in mind this conversation is cloud-agnostic and therefore for everyone, not just Microsoft Azure customers. In fact, if you know me personally you will know my favorite CI/CD tool is Jenkins.

The questions are organized into the following categories, loosely structured around Microsoft’s Well Architected Framework:

Release Management
Pipelines
Security
Governance
Cost Optimization

Release Management

1. What is your versioning scheme?

Can you tell me which versions of your code are on Dev vs QA vs Production?
Consider Semantic Versioning format of MAJOR.MINOR.PATCH is most common in Open Source Software.

2. Do you use naming conventions?

Branch names Common examples include:
- feat/*
- fix/*
- main
- production
- qa
Commit messages Conventional Commits is a common standard in Open Source. Examples include:
- docs(readme): add instructions
- chore(deps): update
- chore(release): 0.7.0
- feat(signup): add new button
- fix(ui): misaligned header

3. Do you have a Change Log?

Is it automated?
It is absolutely OK to start with a manual change log.

This is an example changelog from my azure-nodejs-demo project, which is generated with Standard Version:

4. Are you linking commits to features, bugs, etc. in your dev planning tool (e.g. Azure Boards or GitHub Issues)?

Note how in the example change log above the features and bug fixes are linked to specific commits. It’s easier than it looks. For more info, see your provider’s documentation:

5. Can you describe your git branching workflow?

There is no single “correct” answer, even for the same use case. A developer team must decide together and commit to following it. One of the most frustrating periods in my career was trying to force my co-workers to work in the pedantic way I do. Unsurprisingly I was not very popular. We were working to port a legacy application to the cloud and eventually the team learned to appreciate git submodules, after they gained experience how. It was my mistake to not let them learn at their own pace.

Pro Tips

Your branch workflow should be documented. Consider also drawing this out.
To test your mastery, see if you can explain your workflow without notes and sketch the workflow from scratch. Start with a simple monolithic project, then do the same with more complex situations, e.g. with:
- dependencies on other services
- distinct environments, e.g. staging, uat and production
- infrastructure - if you own it and have infrastructure as code

Resources: Getting Started with Git Workflows

This comparing workflows article from Atlassian is a good place to start.
OneFlow is also popular, more recent and worth mentioning.

Pipelines

Please note these questions will be very workload specific. If you are trying to measure your own expertise, try mapping out answers for both simple and complex workloads.

6. Do your pipelines generate assets, e.g. binaries, builds?

How are they archived?
How are they distributed? How many people in your organization have access?
Have you built artifacts that contain secrets or certificates? Have you secured them? Note: obviously you should not do this. But sometimes you have to deal with a legacy application.

7. When your pipeline runs, how many environments does it deploy to?

One push should trigger deployment(s) for a single environment. Confirm that you have used conditions and triggers properly to ensure production is not accidentally deployed to.

8. Do you schedule your pipelines to run regularly to ensure it still works?

Are you just running unit tests?
Are you also deploying to (non-prod) environments?

9. How do you re-use pipeline code?

If you are just starting with DevOps, ignore this. An additional abstraction layer will not help you master the one measure that matters: how often you deploy. If you choose to go this route, I would ask you:

WHY? What do you hope to achieve?
What is your versioning model?
Is this a public or private library? If private, how to you secure it?
Who owns and maintains this code?

If you want to pursue knowledge transfer in your organization, I can tell you based on first hand experience at Allianz Germany, this is more daunting than it appears. If you don’t create and communicate your ownership and collaboration model correctly from the beginning, you’ll end up with dozens of forks and trying to support outdated versions and are maybe worse off than if you didn’t have libraries to begin with.

Vendor Documentation

10. Pull Requests - do they trigger pipelines? Which ones?

As I explain in this YouTube video, Pull Requests are a security backdoor. Therefore, make sure you go through all your pull request workflows and pipeline code to ensure they only run when you intend them to run.
Are you sure production is not accidentally deployed? This is an important sanity check question. I often ask myself this too to ensure I verify my assumptions and work before moving on.

Deployment Strategies

11. What is the difference between your dev and prod environments? How does it affect your confidence to deploy to production?

Some people are comfortable with just a dev and production environment. Other teams want a more stable “staging” environment before production. Which group do you belong to? Why?
Most people think about source code when it comes to pre-production. What about your data? Do you have test data that is as close to production as possible? How?

12. What is your production rollout strategy?

It is very much OK to deploy manually to production, regardless of whether your organization are new to CI/CD or not. Some organizations disallow automatic deliveries (to production) for compliance reasons.

If you practice continuous delivery (and most of us are not Netflix), here are the most common options:

Rolling Updates
Blue/green deployments
Canary deployments

If you choose automatic deliveries, I would challenge you further on the following questions 13-15 that also relate to deployment.

13. How do you update your database when you release a new feature to your data models?

Do you migrate the database first and then release the new code? Or vice versa? Why?
Is this done via your software Framework, e.g. Active Model or Entity Framework? Or are you writing SQL scripts?
What happens if you have 2 versions of your application running against same database?
How do you revert a database migration? Also a part of Question 15.
Do you have model validations in your software? Do you know if existing production data is still valid? How?

14. How do you know if a deployment succeeded?

Do you have automated end to end tests? What is your coverage percentage?
Are you testing by hand?
Sometimes a deployment is successful but server returns a 50x. How would you catch this? What role does monitoring play here?

15. How do you perform rollbacks?

Let’s assume a security bug was deployed in your last release…

How do you rollback code and the database if needed?
Will your users notice? In what way?
How will you document and version this rollback?

Think about consequences of just overwriting existing code. Can you really just do a simple git revert?

16. Do production deployments need to be approved manually?

If so, how are you achieving this? Examples include:

Pull Requests
Approvals and Release Gates

Security

17. Credentials and Secrets:

Where are your credentials stored?
Can they be exposed as plain text in any way? What happens if a developer tries to echo $SECRET in a pipeline?

It’s important to understand that giving access to run a pipeline is giving access to the secret. Once in the build job (via pipeline as code), a rogue developer could send the credential off to another location if she wanted to. Therefore it’s important to discuss the role of pull requests here and how to separate credentials across environments.

18. How are you separating and storing configuration?

What is saved in git?
What is configured in environment variables?
Which credentials are stored in the build server?
Which credentials are stored in a secret management service, e.g. Azure Key Vault or HashiCorp Vault?
How do you ensure development environments only have access to development credentials and ditto with production?

Governance

19. Are you using a single identity plane across CI/CD and the cloud?

Basically I am asking if you have the same RBAC both to the cloud API directly as well as to CI/CD starting with git? If not, you may have a back door somewhere because you would need to keep your RBACs in sync.

I have seen scenarios where developers were not allowed to access production environments from Azure Portal and Azure CLI. But if they knew how to trigger the pipelines, they could potentially take down production anyway.

20. How have you documented RBAC and ACLs?

Governance is complex, even for smaller teams. Maybe you can answer my questions today, what about next month? That’s why you need to document.

See “Default permissions and access for Azure DevOps” as an example.

21. How are you ensuring only authorized developers can deploy to production?

This is an open ended question designed to test how well you understand your workflow. If you are in my customer session, I would ask you to share your screen and show access controls. I don’t go over them line by line but look at:

Branch Protection configuration, e.g. require pull requests, are force pushes allowed?
Pull Request configuration, e.g. who can can approve, passing build requirements, etc.

22. How do you handle access to shared protected resources? (if applicable)

In larger organizations, there may be shared resources, e.g. an artifact registry that is managed outside the developer team.

Who has write access? To which scope?
Which resources must be shared and how do you ensure that developers have read-only access?

23. Are you signing your commits to verify identity? (if applicable)

Note: git only checks integrity, not authentication. The only way to verify authorship of a commit is to sign commits. Unfortunately Azure DevOps does not support this. But GitHub does.

Cost Optimization

24. Do you clean up artifacts?

Some build jobs will produce an artifact for every run. How do you clean up the ones that never make it to production and store the ones that do?

25. Do you use a different environment for development that is sized accordingly?

To save costs, your non-production environments should be configured for less performance.

Conclusion

So how did you do? After going through this list you should be able to measure your own personal confidence in your CI/CD workflow to meet your requirements. Not every question in this list may be (or should be) relevant to you. If you are unsure, do not worry. Sometimes you just need to stop for a second and document what you are currently doing.

The most important thing is to realize where you stand now, and where you want to be.

List of questions last updated 28 February 2021.

ARM Templates vs Terraform vs Pulumi - Infrastructure as Code in 2021

2021-01-26T01:00:00+01:00

A few years ago Pulumi introduced code-native programming language for Infrastructure as Code (IaC), bringing it closer to the developer and their existing skillset. Fast-forward to 2021 and Microsoft and HashiCorp are playing catch-up to Pulumi and to each other. To help you choose IaC technology, let’s look at IaC programming languages for short-term developer happiness and code re-use for long-term productivity.

Just want a summary? Watch the Ask Me Anything (AMA) style answer

Features Comparison Table

Although I have created a feature comparison table below, I discuss many of the features, but not all of them. This should be a good springboard to help you learn more about each technology.

Feature	ARM	Terraform	Pulumi
Language	JSON + Bicep	HCL/DSL	Code Native, e.g. JavaScript, Python
Languages (in preview)	Bicep DSL	CDK for Terraform, Python and TypeScript Support	-
Clouds	Azure-only	Agnostic + on-prem	Agnostic + on-prem
Preview Changes	`az deployment … what-if`	`terraform plan`	`pulumi preview`
Rollback Changes	Rollback	Revert code & Re-deploy	Revert code & Re-deploy
Infrastructure Clean Up	No	`terraform destroy`	`pulumi destroy`
Deployment History	Deployment History	SCM + Auditing*	SCM + Auditing*
Code Re-Use	Hosted JSON URIs	Modules + Registry*	Code-Native Packages, e.g. npm or pip
State Files	No State File	Plain-text	Encrypted

* refers to a premium feature from vendor, i.e. Terraform Cloud or Pulumi Enterprise.

Instead I want to focus on optimizing your choice for developer happiness, which is strongly tied with productivity. People choose human friendly Domain Specific Languages (DSL) and Code-Native languages because if they can code faster and deploy more often, they are more productive - and thus more happy.

So let’s do a comparison from these 2 perspectives

Happiness Today - how quickly can I as an engineer work with each technology’s flavor of Infrastructure as Code?
Happiness Tomorrow - as my application and company grows, how easily can I scale my IaC with re-usable components?

ARM Templates

As a Microsoft engineer, I should point out the major reasons to use Azure Resource Manager (ARM) before I elaborate on why I personally don’t use it:

First Party Support
Because ARM is Azure exclusive, all Azure resources are supported, from the simple resource group to complicated policies and blueprints. And your deployments are most likely to work out of the box as expected.
No state file required
ARM Templates queries the APIs directly for current state. So you do not have to worry about securing this state file like with other IaC technologies.
Deployment Histories included
Deployment history is included out of the box. While you have IaC with your _intended changes_in your git history, Azure can tell you the actual deployed changes.

ARM Improvements in 2021

The following were gaps in ARM that existed before 2020 and the major reasons I never properly learned it. But Microsoft has caught on to the competitors and are filling the following gaps:

Detect Drift with what-if
Last year Microsoft implemented the what-if command, which is the equivalent of terraform plan, which lets you preview infrastructure changes before you deploy. This lets your preview if destructive changes might happen.
JSON is for machines
If I want to author infrastructure, I don’t think in JSON, which is why it feels so unnatural. See below for more details, including new DSL Bicep.

ARM’s Biggest Pain Point - JSON

The main reason I don’t use ARM is because I don’t like writing JSON. When I write code I often use comments and the /* */ syntax in ARM feels like a cheat. To illustrate, this is an example ARM Template for an Azure Storage Account:

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "storageAccountName": {
      "type": "string"
    },
    "containerName": {
      "type": "string",
      "defaultValue": "logs"
    },
    "location": {
      "type": "string",
      "defaultValue": "[resourceGroup().location]"
    }
  },
  "functions": [],
  "resources": [
    {
      "type": "Microsoft.Storage/storageAccounts",
      "apiVersion": "2019-06-01",
      "name": "[parameters('storageAccountName')]",
      "location": "[parameters('location')]",
      "sku": {
        "name": "Standard_LRS",
        "tier": "Standard"
      },
      "kind": "StorageV2",
      "properties": {
        "accessTier": "Hot"
      }
    },
    {
      "type": "Microsoft.Storage/storageAccounts/blobServices/containers",
      "apiVersion": "2019-06-01",
      "name": "[format('{0}/default/{1}', parameters('storageAccountName'), parameters('containerName'))]",
      "dependsOn": [
        "[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName'))]"
      ]
    }
  ]
}

I’ve been at Microsoft for over 1.5 years and I still can’t write ARM templates. The reality is I will probably skip ARM and instead learn to write Bicep.

ARM Bicep DSL

Bicep is a Domain Specific Language or DSL, which compiles to standard ARM template JSON. Looking at this example from the GitHub project repo, you may see similarities to Terraform’s HashiCorp Language DSL:

// Bicep 💪
param storageAccountName string
param containerName string = 'logs'
param location string = resourceGroup().location

resource sa 'Microsoft.Storage/storageAccounts@2019-06-01' = {
  name: storageAccountName
  location: location
  sku: {
    name: 'Standard_LRS'
    tier: 'Standard'
  }
  kind: 'StorageV2'
  properties: {
    accessTier: 'Hot'
  }
}

resource container 'Microsoft.Storage/storageAccounts/blobServices/containers@2019-06-01' = {
  name: '${sa.name}/default/${containerName}'
}

Although I personally would prefer storageaccount over sa, I am overall quite excited about Bicep.

ARM & Bicep Summary - Promising Future

If we can get a DSL like Terraform but also get first party support for Azure features sooner, that could be an IaC game changer for Azure-only workloads. Azure has also filled the preview gap with the az deployment… what-if command which was really missing.

The code re-use strategy with modules is still very experimental. See this discussion about sharing references across modules. This is a last major gap for me personally before I would consider using Bicep in production.

Everything is still experimental but very promising.

Terraform

Terraform is my favorite IaC technology and what I personally use because it’s so human-friendly, cloud-agnostic and solid. These are the major features of Terraform:

HashiCorp Language - Human Friendly DSL
Reading and writing the HCL flows naturally and is a joy to use. More details below.
Cloud Agnostic
Although the cloud vendor providers are rather specific, mastering Terraform helps you master IaC for any cloud.
Preview Infrastructure Changes
Run terraform plan and check you don’t accidentally blow up your infrastructure. Also use the -detailed-exitcode flag, so you can adjust your CI/CD builds based on whether or not configuration drift was detected..
Clean Up Infrastructure
Run terraform destroy and easily remove any infrastructure, great for clean up after an experiment, or for starting over if something breaks beyond repair. This works because Terraform keeps a record of your infrastructure in a state file.
Code Re-use with Modules
This is so easy that it’s fun to write modules. The DSL is easy to understand and I can have local and hosted modules, either in git or a Terraform Registry (public or private). This is the deciding factor and most important Terraform advantage over its competitors. See details in last section of this article.

HashiCorp Language - Terraform’s DSL

Ok let’s look at the main reason I chose Terraform - for HashiCorp Language (HCL), the human friendly DSL. This is an example from the Terraform Documentation:

resource "azurerm_resource_group" "example" {
  name     = "example-resources"
  location = "West Europe"
}

resource "azurerm_storage_account" "example" {
  name                     = "storageaccountname"
  resource_group_name      = azurerm_resource_group.example.name
  location                 = azurerm_resource_group.example.location
  account_tier             = "Standard"
  account_replication_type = "GRS"

  tags = {
    environment = "staging"
  }
}

It’s like reading English. I LOVE it.

(Dis)advantages vs ARM

These are the most common arguments I hear against Terraform when compared to ARM:

Not every Azure Resource exists outside ARM
There isn’t a Terraform Provider for every ARM type. Or even if there is, e.g. Azure Policy, you’re still just writing ARM JSON inside another language.
State File in plain text 🧐
If you create resources with credentials, e.g. a database or create service principals, these secrets are stored in plain text in your Terraform state file.

State files as plain text scares many people. Personally I am less concerned and accept this trade-off because I have confidence in my code quality, CI/CD governance, and security e.g. I use short lived tokens and scoped permissions.

If your security team cannot live with this, then delete the state file after the resources are created. No file, no problem 🤷‍♀️ Some tasks, like creating scoped service principals at scale are so much easier with Terraform because it can talk to both the ARM and the Azure Active Directory API. Create the credentials, immediately throw them in Key Vault and delete the state file afterwards. I’m pragmatic.

Terraform in TypeScript and Python - New since 2020

In July 2020 HashiCorp introduced Cloud Development Kit (CDK) for Terraform, which lets you write IaC in code native languages like TypeScript and Python.

This is an example from their GitHub repo:

import { Construct } from 'constructs';
import { App, TerraformStack } from 'cdktf';
import { AzurermProvider, VirtualNetwork } from './.gen/providers/azurerm'

class MyStack extends TerraformStack {
  constructor(scope: Construct, name: string) {
    super(scope, name);

    new AzurermProvider(this, 'AzureRm', {
      features: [{}]
    })

    new VirtualNetwork(this, 'TfVnet', {
      location: 'uksouth',
      addressSpace: ['10.0.0.0/24'],
      name: 'TerraformVNet',
      resourceGroupName: '<YOUR_RESOURCE_GROUP_NAME>'
    })
  }
}

const app = new App();
new MyStack(app, 'typescript-az');
app.synth();

Because it’s TypeScript, it’s very familiar to JavaScript engineers like myself.

But I personally prefer HashiCorp Language (HCL) because it is meant for humans. As a human it is much easier for me to read and scan. It’s like HCL speaks to me, meeting me halfway. Even though I know JavaScript, I still have to read the code entirely.

That is my personal preference. Maybe JavaScript speaks more to you 🤓

Pulumi

And finally we have Pulumi, the new kid on the IaC block who introduced the concept of code-native IaC. Pulumi’s largest value proposition is that engineers don’t have to learn a new programming language.

And looking at this Pulumi example from their documentation, it looks much cleaner than the CDK for Terraform:

import * as pulumi from "@pulumi/pulumi";
import * as azure from "@pulumi/azure";

const exampleResourceGroup = new azure.core.ResourceGroup("exampleResourceGroup", {location: "West Europe"});
const exampleAccount = new azure.storage.Account("exampleAccount", {
    resourceGroupName: exampleResourceGroup.name,
    location: exampleResourceGroup.location,
    accountTier: "Standard",
    accountReplicationType: "GRS",
    tags: {
        environment: "staging",
    },
});

It probably looks cleaner because it’s been around longer and Pulumi has had ample time to fine tune its abstraction to make it as close to a friendly DSL as possible. And this kind of friendly abstraction layers is an art form. So kudos to Pulumi for achieving this 👌

Encrypted State File

Like Terraform, Pulumi also uses a state file to keep track of your infrastructure, which helps it do configuration drift detection and clean up resources.

Unlike Terraform, however, Pulumi’s state file is encrypted which is more secure.

Give Pulumi a Chance

Sorry I am not covering Pulumi further. I don’t use it so I am not going to pretend to be an expert. I did some research because one of my YouTube subscribers asked me to do this comparison. This does not mean I do not recommend Pulumi.

If you are still deciding which IaC technology is right for you, you should also consider Pulumi, especially if you want to write IaC in a code-native programming language like JavaScript, Python, etc.

Code Re-Use

So now you have had and introduction to the “flavors” of Infrastructure as Code. You may even have a favorite. We can imagine ourselves writing a bit of code. Now let’s imagine scaling that IaC to many environments and applications. How can we leverage code re-use?

ARM Template Links

If you want to create a template for re-use you need to send a URI to the main template. It is not possible to pass a local file. If if you can send a protected link, you still have to publish it, which makes development and iteration of templates painfully slow.

This is what a templateLink looks like:

"resources": [
  {
    "type": "Microsoft.Resources/deployments",
    "apiVersion": "2019-10-01",
    "name": "linkedTemplate",
    "properties": {
      "mode": "Incremental",
      "templateLink": { // Painful 😖
        "uri": "https://mystorageaccount.blob.core.windows.net/AzureTemplates/newStorageAccount.json",
        "contentVersion": "1.0.0.0"
      },
      "parametersLink": { // Painful 😖
        "uri": "https://mystorageaccount.blob.core.windows.net/AzureTemplates/newStorageAccount.parameters.json",
        "contentVersion": "1.0.0.0"
      }
    }
  }
]

And don’t forget to append a SAS token to the URI to access the JSON file… now it’s clear why I don’t use ARM, right?

Terraform Modules

As an engineer I need to be able to work with local code when I am initially experimenting or for quick debugging. In Terraform, it’s really easy to create modules, which can be local or published to an external registry.

# Custom Module example
module "dev_cluster" {
  source              = "./../aks-cluster"
  name                = "dev-cluster"  
  vm_size             = "Standard_D2s_v3"   # ca. 68 EUR/mo.
  ssh_public_key      = "~/.ssh/id_rsa.pub"
  vnet_address_space  = ["10.100.0.0/25"]
  aks_subnet_prefixes = ["10.100.0.0/28"]
}

From the example it is clear how I can re-use infrastructure modules to easily create different deployment environments that vary slightly. For example, I can use the same custom aks-cluster module to create a cluster for production and choose more expensive Virtual Machines.

You can also publish your modules to the public terraform registry or a private registry in Terraform Cloud.

Pulumi Packages

Because Pulumi uses code native programming languages, you would leverage the language’s code re-use techniques. For example in JavaScript you create packages that you could publish to a registry as a node module.

This is a piece of example code from this Pulumi Blog article describes re-use in detail:

/**
 * Static website using Amazon S3, CloudFront, and Route53.
 */
export declare class StaticWebsite extends pulumi.ComponentResource  {
  readonly contentBucket: aws.s3.Bucket;
  readonly logsBucket: aws.s3.Bucket;
  readonly cdn: aws.cloudfront.Distribution;
  readonly aRecord?: aws.route53.Record;

  constructor(name: string , contentArgs: ContentArgs,
              domainArgs?: DomainArgs, opts?: pulumi.ResourceOptions);
}

Then you could use it like this:

// If you have publshed it to an NPM registry
import { StaticWebsite } from "static-website-aws";

// OR reference a local file
import { StaticWebsite } from "./static-website-aws";

// Then
const website  = new StaticWebsite ("browserhack", {
  pathToContent:"./browserhack",
  custom404Path:"/404.html",
});

Which IaC makes you most happy?

So now you’ve seen how programming Infrastructure as Code in ARM Templates, Terraform and Pulumi compare to each other.

You know my opinions. Which one is your favorite? I’d love to know, especially if you are using Pulumi in production. Let me know via @jng5 on Twitter or on YouTube.

Terraform on Azure Pipelines Best Practices

2021-01-14T01:00:00+01:00

Azure Pipelines and Terraform make it easy to get started deploying infrastructure from templates. But how do you go from sample code to real life implementation, integrating git workflows with deployments and scaling across across multiple teams? Here are 5 Best Practices to get you started on the right foot.

As an engineer in the Azure Customer Experience (CXP) organization, I advise customers with best practice guidance and technical deep dives for specific use cases. This article is based both on recurring themes with customers as well as my previous role as an Enterprise Architect at Allianz Germany when we started our cloud migration in 2016.

Five Best Practices

Use YAML Pipelines, not UI
Use the Command Line, not YAML Tasks
Use Terraform Partial Configuration
Authenticate with Service Principal Credentials stored in Azure Key Vault
Create a Custom Role for Terraform

TL;DR; Watch this 5 minute summary instead:

Tip #1 - Use YAML Pipelines, not UI

The Azure DevOps service has its roots in Visual Studio Team Foundation Server and as such it carries legacy features, including Classic Pipelines. If you’re creating new pipelines, do not start with classic pipelines. If you have classic pipelines, plan on migrating them to YAML. Industry best practice is to author Pipelines as Code and in Azure Pipelines, that means YAML Pipelines.

If you use Classic Pipelines, do not panic. They will be around for a while. But as you can see from public features timeline and public road map, Microsoft is investing more in YAML pipelines. To be more future proof, choose YAML pipelines.

Tip #2 - Use the Command Line, not YAML Tasks

I have a love hate relationship with Pipeline Tasks. As an abstraction it lowers the barrier to entry. They make tasks platform independent (Windows vs Linux)and pass return codes so you don’t have to handle stderr and stdout by hand. See Source Repo on GitHub for other advantages.

But as the README itself says:

If you need custom functionality in your build/release, it is usually simpler to use the existing script running tasks such as the PowerShell or Bash tasks.

And indeed, I find it simpler to use plain old CLI commands in Bash. Over time, as you iterate and create tailored pipelines beyond the “Hello World” examples, you may also find that tasks becomes yet another layer to debug. For example, I used the AzCopy task only to have to wait a few minutes for the pipeline fail because it’s Windows only.

Iterate Faster

If I use the command line, I can figure out exactly which -var and other options I need to pass to terraform to achieve the results I want from my local machine without having to wait minutes for each pipeline job to run to know if it worked or not. Once I am confident in my CLI commands, I can put those in my YAML pipeline.

Master the Technology not a Task

In general I recommend every engineer learn how to use a technology from the command line. Do not learn how to use the git extension in your code editor. If you learn something on the command line, be it git or terraform, you learn how it works. Debugging will be far less frustrating as you can skip an abstraction layer (the YAML task) that does not necessarily make your life easier.

For example, I prefer to skip this verbose format found in this example from the Azure documentation

# Verbose 😑
- task: charleszipp.azure-pipelines-tasks-terraform.azure-pipelines-tasks-terraform-cli.TerraformCLI@0
  displayName: 'Run terraform plan'
  inputs:
    command: plan
    workingDirectory: $(terraformWorkingDirectory)
    environmentServiceName: $(serviceConnection)
    commandOptions: -var location=$(azureLocation)

You can do the same using Bash and just pass the flags, e.g. -var or -out as is.

# Less noise 👌
- bash: terraform plan -out deployment.tfplan
  displayName: Terraform Plan (ignores drift)

Because I do not use tasks, I never need to look up in more documentation what environmentServiceName and other attributes do and expect. I only ever need to know Terraform, which lets me focus on my code instead of debugging a dependency - even if it’s provided by Microsoft.

Do not install Terraform - keep up with “latest”

There are many Azure Pipeline samples out there with “installer” tasks, including official examples. While dependency versioning is important, I find Terraform to be one of the more stable technologies that rarely have breaking changes. Before you lock yourself down to a version, consider always running with the latest version. In generally it’s easier to make incremental changes and fixes than to have giant refactors later that block feature development.

You can see which version is installed on the Microsoft hosted build agents on GitHub, e.g. Ubuntu 18.04. Note these build agents are used both by Azure Pipelines and GitHub Actions.

CLI is vendor agnostic

This preference for CLI mastery over YAML tasks is not Terraform specific. If you browse through my various demos on GitHub, I usually prefer Docker and Node.js on the command line over the equivalent YAML tasks.

The industry is fast-paced. Using the CLI also makes your migration path to new vendors easier. If in the future, when GitHub Actions have matured and you want to migrate from Azure Pipelines, you would not need to migrate the YAML task abstraction layer. Use the CLI and make your future life easier.

But Then How do I Authenticate to Azure?

That is a common question I get from customers. Keep reading. This is in the last section of this article which also discusses secret management in pipelines.

Tip #3 - Use Terraform Partial Configuration

This topic deserves its own article. But I will mention the most important points. You will need a state file when collaborating with other engineers or deploying from a headless build agent.

Start with Local State

If you don’t know how your infrastructure should look, experiment locally i.e. don’t use a remote backend to avoid CI/CD wait time of minutes.

As you try things out, you will probably break things. At this phase instead of trying to fix it, I will just tear everything down do a rm -rf .terraform and start over.

Once your infrastructure architecture is stable, proceed to create a remote state file.

Create a Storage Account for your State File

Terraform needs an Azure Blob Storage account. ProTip - create the Storage Account by hand using the Azure CLI:

$ az storage account create \
  --name mystorageaccountname \
  --resource-group myresourcegroupname \
  --kind StorageV2 \
  --sku Standard_LRS \
  --https-only true \
  --allow-blob-public-access false

Because Terraform state files store everything including secrets in clear text, take extra precaution in securing it. Confirm that you have disabled public blob access.

Please do not rely on a pipeline task to create the account for you! There is a task that does this, but the storage account is configured to allow public access to blob files by default. The individual state files themselves are secured. But the defaults are not secure, which is a security risk waiting to happen.

Do not use Default Configuration

When using a remote backend, you need to tell Terraform where the state file is. Examples Configuration from the official documentation look like this:

# Don't do this
terraform {
  backend "azurerm" {
    resource_group_name  = "StorageAccount-ResourceGroup"
    storage_account_name = "abcd1234"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"

    # Definitely don't do this!
    access_key           = "…"
  }
}

Use Partial Configuration

Further in the documentation Terraform recommends moving out those properties and using Partial Configuration:

# This is better
terraform {
  backend "azurerm" {
  }
}

Create and Ignore the Backend Configuration File

Instead of using Azure Storage Account Access Keys, I use short-lived Shared Access Signature (SAS) Tokens. So I create a local azure.conf file that looks like this:

# azure.conf, must be in .gitignore
storage_account_name="azurestorageaccountname"
container_name="storagecontainername"
key="project.tfstate"
sas_token="?sv=2019-12-12…"

Triple check that your azure.conf is added to the .gitignore file so that it is not checked into your code repository.

It’s OK to use a File in Local Development

On my local machine, I initialize Terraform by passing whole configuration file.

$ terraform init -backend-config=azure.conf

Side note: one of the reasons I use SAS tokens is that I usually only need to work with the remote state file in a project’s initial phase. Instead of leaving an access key lying around, I have a an expired SAS token on my local machine.

Your configuration should NOT be a .tfvars File

Variables in with a .tfvars extension are automatically loaded, which is an accident waiting to happen. This is how people unintentionally check credentials into git. Don’t be that person or company. Add a little bit of friction and use the -backend-config=azure.conf CLI option.

You can also give the file a .hcl extension for your editor to do syntax highlighting. I use .conf as a convention to signal a warning that this file may contain sensitive information and should be protected.

Use Key Value Pairs in CI/CD Builds

Personally I do not use Secure Files in Azure Pipelines because I don’t want to have my credentials in yet another place I have to find and debug. To solve the first problem, I use Key Vault (keep reading).

To solve the second problem I pass the configuration as individual variables to the terraform init command:

$ terraform init \
  -backend-config="storage_account_name=$TF_STATE_BLOB_ACCOUNT_NAME" \
  -backend-config="container_name=$TF_STATE_BLOB_CONTAINER_NAME" \
  -backend-config="key=$TF_STATE_BLOB_FILE" \
  -backend-config="sas_token=$TF_STATE_BLOB_SAS_TOKEN"

If you are not using SAS Tokens, you can pass the Storage Account Access Key with -backend-config="access_key=…"

By using key value pairs, I am being explicit, forcing myself to do sanity checks at every step and increasing traceability. Your future self will thank you. Note also that my variables are named with the TF_ prefix to help with debugging.

So the complete step in YAML looks like this

# Load secrets from Key Vault
variables:
  - group: e2e-gov-demo-kv

# Initialize with explicitly mapped secrets
steps:
- bash: |
    terraform init \
      -backend-config="storage_account_name=$TF_STATE_BLOB_ACCOUNT_NAME" \
      -backend-config="container_name=$TF_STATE_BLOB_CONTAINER_NAME" \
      -backend-config="key=$TF_STATE_BLOB_FILE" \
      -backend-config="sas_token=$TF_STATE_BLOB_SAS_TOKEN"
  displayName: Terraform Init
  env:
    TF_STATE_BLOB_ACCOUNT_NAME:   $(kv-tf-state-blob-account)
    TF_STATE_BLOB_CONTAINER_NAME: $(kv-tf-state-blob-container)
    TF_STATE_BLOB_FILE:           $(kv-tf-state-blob-file)
    TF_STATE_BLOB_SAS_TOKEN:      $(kv-tf-state-sas-token)

Continue reading to learn how the Key Vault integration works. We will also use this strategy to authenticate to Azure to manage our infrastructure.

Tip #4 - Authenticate with Service Principal Credentials stored in Azure Key Vault

We often celebrate when we finally have something working on our local machine. Unfortunately it may be too soon to party. Moving those same steps to automation pipelines requires more effort that conceptually is sometimes difficult to understand.

Why does `az login` not work in CI/CD?

In short, it does not work because a build agent is headless. It is not a human. It cannot interact with Terraform (or Azure for that matter) in an interactive way. Some customers try to authenticate via the CLI and ask me how to get the headless agent past Multi-factor Authentication (MFA) that their organization has in place. That is exactly why we will not use the Azure CLI to login. As the Terraform Documentation explains

We recommend using either a Service Principal or Managed Service Identity when running Terraform non-interactively (such as when running Terraform in a CI server) - and authenticating using the Azure CLI when running Terraform locally.

So we will authenticate to the Azure Resource Manager API by setting our service principal’s client secret as environment variables:


- bash: terraform apply -auto-approve deployment.tfplan
  displayName: Terraform Apply
  env:
    ARM_SUBSCRIPTION_ID: $(kv-arm-subscription-id)
    ARM_CLIENT_ID:       $(kv-arm-client-id)
    ARM_CLIENT_SECRET:   $(kv-arm-client-secret)
    ARM_TENANT_ID:       $(kv-arm-tenant-id)

The names of the environment variables, e.g. ARM_CLIENT_ID are found in this Terraform Documentation. Some of you might be thinking, are environment variables secure? Yes. By the way the official Azure CLI Task is doing the same thing if you examine line 43 in the task source code.

To be clear we authenticate headless build agents by setting client IDs and secrets as environment variables, which is common practice. The best practice part involves securing these secrets.

Double Check You are Using Pipeline Secrets

In Azure Pipelines having credentials in your environment however is only secure if you mark your pipeline variables as secrets, which ensures:

The variable is encrypted at rest
Azure Pipelines will mask values with *** (on a best effort basis).

Look for the lock icon to ensure you've marked your variables as secrets

If you switch to plain text, the secret does not appear. Instead you need to reset it.

The caveat to using secrets is that you have to explicitly map every secret to an environment variable, at every pipeline step. It may be tedious, but it is intentional and makes the security implications clear. It is also like performing a small security review every time you deploy. These reviews have the same purpose as the checklists that have been scientifically shown to save lives. Be explicit to be secure.

Go Further - Key Vault Integration

Ensuring you are using Pipeline Secrets may be good enough. If you want to go a step further, I recommend integrating Key Vault via secret variables - not a YAML task.

Use the "Link secrets…" toggle to integrate Key Vault.

Note “Azure subscription” here refers to a service connection. I use the name msdn-sub-reader-sp-e2e-governance-demo to indicate that the service principal under the hood only has read-only access to my Azure Resources.

These are reasons large companies and enterprises may choose this route:

Re-use secrets across Azure DevOps projects and Azure DevOps organizations. You can only share Service Connections across projects.
Stronger security with Azure Key Vault. Together with the proper service principal permissions and Key Vault access policy, it becomes impossible to change or delete a secret from Azure DevOps.
Scalable secret rotation. I prefer short-lived tokens over long-lived credentials. Because Azure Pipelines fetches secrets at start of build run-time, they are always up to date. If I regularly rotate credentials, I only need to change them in 1 place: Key Vault.
Reduced attack surface. If I put the credential in Key Vault, the client secret to my service principal is stored only in 2 places: A) Azure Active Directory where it lives and B) Azure Key Vault.

If I use a Service Connection, I have increased my attack surface to 3 locations. Putting on my former Enterprise Architect hat… I trust Azure DevOps as a managed service to guard my secrets. However, as an organization we can accidentally compromise them when someone (mis)configures the permissions.

ProTip - the variables above are all prefixed with kv- which is a naming convention I use to indicate those values are stored in Key Vault.

Tip #5 Create a Custom Role for Terraform

Security and RBAC best practice is to grant only as much access as necessary to minimize risk. So which Azure role do we assign the Service Principal used by Terraform? Owner or Contributor?

Neither. Because we are deploying infrastructure, we will probably also need to set permissions, for example create a Key Vault Access Policy, which requires elevated permissions. To see which permissions Contributors lack we can run this Azure CLI command:

az role definition list \
  --name "Contributor" \
  --output json \
  --query '[].{actions:permissions[0].actions, notActions:permissions[0].notActions}'

which will output the following:

[
  {
    "actions": [
      "*"
    ],
    "notActions": [
      "Microsoft.Authorization/*/Delete",
      "Microsoft.Authorization/*/Write",
      "Microsoft.Authorization/elevateAccess/Action",
      "Microsoft.Blueprint/blueprintAssignments/write",
      "Microsoft.Blueprint/blueprintAssignments/delete"
    ]
  }
]

To create a Key Vault Access Policy, our service principal will need "Microsoft.Authorization/*/Write" permissions. The easiest solution is to give the service principal the Owner role. But this is the equivalent of God mode.

Consequences of Delete

There are fine but important differences not just for large enterprises but also compliant industries. So if you’re a small Fintech startup, this applies to you too. Some data cannot be deleted by law, e.g. financial data needed for tax audits. Because of the severity and legal consequences of losing such data, it is a common cloud practice to apply management locks on a resource to prevent it from being deleted.

We still want Terraform to create and manage our infrastructure, so we grant it Write permissions. But we will not grant the Delete permissions because:

Automation is powerful. And with great power comes great responsibility, which we don’t want to grant a headless (and therefore brainless) build agent.
It’s important to understand that git (even with signed commits) gives technical traceability, but in your organization that might not satisfy requirements for legal audit-ability.

So even if you have secured your workflow with Pull Requests and protected branches, it may not be enough. Therefore, we will move the Delete action from the git layer to the cloud management layer, i.e. Azure for audit-ability, using management locks.

So create a custom role and make sure you have the following notActions:

{
  "notActions": [
    "Microsoft.Authorization/*/Delete"
  ]
}

The code does not specify Azure Blueprints. Use the same reasoning above to determine if in your use case, you need access and when to restrict it.

Summary

In this long guide we covered a few general Azure Pipeline Best Practices to use Pipelines as Code (YAML) and to use the command line, which helps you master Terraform and any other technology. We also walked through how to properly secure you state file and authenticate with Azure, covering common gotchas. Finally the last two topics of Key Vault integration and creating a custom role for Terraform.

If there is too much security in this article for you, that’s okay. Do not implement every practice at the same time. Practice one at a time. And over time, at least months, security best practices become second nature.

This article focused specifically on Best Practices when using Azure Pipelines. Stay tuned for another article on generic best practices, where I explain how to use git workflows and manage infrastructure across environments.

Creating Monorepo Pipelines in Azure DevOps

2020-03-25T01:00:00+01:00

Although uncommon, there are valid reasons to have a monorepo - a single git repository for multiple projects, for example migration projects. Until yesterday, I thought this was not possible in Azure DevOps.

A colleague informed me it’s possible to rename the file something other than /azure-pipelines.yml. From there I figured out how to accomplish create multiple Azure DevOps YAML pipelines in a monorepo.

In this tutorial, you will learn how to:

setup a root pipeline
setup 2 pipelines in subfolders and triggered by changes in those folders.
rename pipelines in DevOps UI
use triggers
change working directories

The full example is available here
https://github.com/julie-ng/azure-devops-monorepo →

Note: these commits were pushed separately to generate distinct “Last run”s.

I. Project Structure and YAML files

Let’s imagine we have the following setup:

  .
  ├── README.md
  ├── azure-pipelines.yml
  ├── service-a
  |── azure-pipelines-a.yml
  │   └── …
  └── service-b
      |── azure-pipelines-b.yml
      └── …

In a standard Azure DevOps project, you have a single azure-pipelines.yml file in your project root folder. In our project, we will have 3 different pipeline files:

azure-pipelines.yml
service-a/azure-pipelines-a.yml
service-b/azure-pipelines-b.yml

These files will be very similar to your standard YAML pipelines, with two small exceptions: triggers and working directories. We’ll cover those later. First we will add the pipelines.

Step 1 - Add the Pipelines

When you create a new DevOps pipeline, select the repository and on the “Configure your pipeline” page, select “Existing Azure Pipelines YAML file”, which will open up this overlay on the right:

You want to go through this process 3 times, each time selecting a different YAML file. In the image above, I have chosen /a/azure-pipelines.yml, which is original filename before I renamed it later.

Step 2 - Rename your pipelines

By default, Azure DevOps names your pipelines per GitHub user/org and repository name, so you will end up with 3 pipelines named similar to this:

julie-ng.azure-devops-monorepo
julie-ng.azure-devops-monorepo (1)
julie-ng.azure-devops-monorepo (2)

Not very helpful. Find more options button and select “Rename/move”.

I’ve chosen the following names:

azure-devops-monorepo (root)
azure-devops-monorepo (Service A)
azure-devops-monorepo (Service B)

II. Triggers and how this works

Normally pipeline runs when anything changes. These three pipelines are defined so they only build when their respective files change. We accomplish this with trigger path definitions.

Root Project must `exclude` paths

Note the root project excludes our subdirectories. This means that a change to service-a/readme.md will not trigger our root pipeline.

trigger:
  paths:
    exclude: # Exclude!
      - 'service-a/*'
      - 'service-b/*'

Sub-projects must `include` paths

We have two sub-projects with their own pipelines. We have to adjust each appropriately so it only runs when the sub-project’s code changes:

trigger:
  paths:
    include: # Include!
      - 'service-a/*' # or 'service-b/*'

Now you have your multi-pipeline monorepo setup! But you are not finished. There are reasons why the monorepo setup is not common. While it is acceptable to choose this path, you should understand the disadvantages and caveats.

III. Caveats

Be Aware of Other Triggers

There are many reasons for a pipeline to be built. In fact, the official docs name four different types of events that can trigger build pipelines:

Trigger	Description
CI triggers	Git Push
PR triggers	Pull Requests
Scheduled triggers	Schedules defined in Cron format
Pipeline triggers	Pipelines can call each other
Manual	A human clicks a button

I add manual runs to make it 5. Although we are limiting path triggers to our subfolder, the when is partially determined by external factors. This means that a pull request that conceptually only affects service B may unintentionally trigger a build of service A.

If you are used to committing and pushing incomplete changes, you may have an unusual number of broken builds. A common symptom is seeing multiple commits in a row that start with “update….” This danger also applies to the separate repo use case. But in a monorepo case it is made worse. The danger here is that a developer or team gets used to red or broken builds and stop reacting to them. So it’s important to be disciplined across your entire team when committing and pushing your changes.

I haven’t tried it. But theoretically you should be able to separate schedules for the pipelines.

Building A or B or both?

Let’s say you have a commit history that looks like this:

0883cf8  b: change number 3 (47 minutes ago) <=== git push
2896d9c  a: change number 5 (49 minutes ago)
3fa6757  root: add newlines to readme (49 minutes ago)

First off, both pipeline A and pipeline B will run and they will run with the files from the working tree at 0883cf8.

In this example, a developer first made changes to service A and then later to service B. Because the changes were pushed together, the azure-pipelines-a.yml pipeline runs with files not from 2896d9c but from the future 🤯.

This means if you actually have dependencies outside of that include path in the triggers: property, you may experience unexpected build results. It seems unlikely in our example. But what if you had such a project structure?

  .  
  ├── service-a
  |   |── pipeline-a.yml
  │   └── …
  ├── service-b
  |   |── pipeline-b.yml
  │   └── …  
  └── common-components
      |── pipeline-c.yml
      └── …

Then you would be more concerned. This is a trade-off that comes with monorepos. Builds may be accidentally triggered and you should prepare for that. If you’re working in teams, make sure it’s very transparent what everyone is working on.

Keep your Working Directory in Mind

To illustrate this caveat, service B is a Node.js project. Although our YAML file for service B sits in the correct subfolder, the working directory will still be the root. If you try to run npm install without changing directories, it will fail because there is no package.json in the root.

We can change this by using the workingDirectory key in the YAML:

- script: npm install
  workingDirectory: service-b/

Unfortunately workingDirectory is only available under steps:, which means you cannot set it once on the whole pipeline, but rather for every task, script, etc. You can make this less painful by using a variable like in my code sample. See the official docs: YAML Reference for details and further limitations in the YAML syntax.

IV. Conclusion

If you have good reason to use a monorepo and want to setup multiple Azure DevOps pipelines, you can. But remember that you lose some sense of control over when and what you are building in your CI pipeline. So if you march down this path, over-communicate within your team, keep your commits squeaky clean, and carry on.

Julie Ng's Blog

Setup git commits and authentication with multiple GPG keys and YubiKeys

Result preview

Use case and problem

Increased security

Why a second YubiKey?

Why a USB Type-A key?

Generate GPG keys with work email

Step 1 - Generate new keys

Step 2 - Move private keys to YubiKey

Step 3 - Export public key and add to GitHub

Step 4 - Configure repository with work user

Configure authentication with multiple accounts

Warning - NEVER store your personal access token in your repository’s remote url

Option 1 - git-credential-manager

Option 2 - GitHub CLI

Option 3 - Permanent authentication with encrypted .netrc

Managing multiple accounts

1 week later… more complicated than thought

Auto-toggle users

Conclusion

Infrastructure as Code and Monorepos - a Pragmatic Approach

What is a Monorepo?

Advantages

Disadvantages

My Use Case

Leverage Software Modules for Multiple Environments

Why are Custom Abstractions necessary?

My Kubernetes Cluster Requirements

Separate Configuration Files per Environment

Use Subfolders Per Environment Configuration

Leverage Path based Pipeline Triggers

Work in Progress Changes

Leverage Resource Tagging and IaC Versioning

Infrastructure as Code Rollbacks

Dev and Staging Rollbacks

Production Rollbacks

Deployments outside of Pipeline Runs

When should you not use a Monorepo?

CI/CD Review - How DevOps in Real Life & Mature Organizations works

Goal of the Review

Release Management

1. What is your versioning scheme?

2. Do you use naming conventions?

3. Do you have a Change Log?

4. Are you linking commits to features, bugs, etc. in your dev planning tool (e.g. Azure Boards or GitHub Issues)?

5. Can you describe your git branching workflow?

Pro Tips

Resources: Getting Started with Git Workflows

Pipelines

6. Do your pipelines generate assets, e.g. binaries, builds?

7. When your pipeline runs, how many environments does it deploy to?

8. Do you schedule your pipelines to run regularly to ensure it still works?

9. How do you re-use pipeline code?

Vendor Documentation

10. Pull Requests - do they trigger pipelines? Which ones?

Deployment Strategies

11. What is the difference between your dev and prod environments? How does it affect your confidence to deploy to production?

12. What is your production rollout strategy?

13. How do you update your database when you release a new feature to your data models?

14. How do you know if a deployment succeeded?

15. How do you perform rollbacks?

16. Do production deployments need to be approved manually?

Security

17. Credentials and Secrets:

18. How are you separating and storing configuration?

Governance

19. Are you using a single identity plane across CI/CD and the cloud?

20. How have you documented RBAC and ACLs?

21. How are you ensuring only authorized developers can deploy to production?

22. How do you handle access to shared protected resources? (if applicable)

23. Are you signing your commits to verify identity? (if applicable)

Cost Optimization

24. Do you clean up artifacts?

25. Do you use a different environment for development that is sized accordingly?

Conclusion

ARM Templates vs Terraform vs Pulumi - Infrastructure as Code in 2021

Features Comparison Table

ARM Templates

ARM Improvements in 2021

Option 1 - `git-credential-manager`

Option 3 - Permanent authentication with encrypted `.netrc`

Why does `az login` not work in CI/CD?

Root Project must `exclude` paths

Sub-projects must `include` paths