Photo by David Hellmann

How it started

Like many start-ups we started with a proof of concept which transformed into a production system. The team was a developer-driven one, with neither system administrators nor “devops” people.
To host this production system we chose the cloud provider with the most appealing free tier. Services were created using the web console, rights & permissions given by hand, … If you are not familiar with cloud provider consoles, they look like this:

An overview of a cloud provider console. © Achim Pfennig (https://www.flickr.com/photos/apenny/)

When you use an interface like this without being a trained professional or someone with eidetic memory, you lose track of your modifications and settings. We thus lacked the opportunity to have a consistent configuration. Moreover, when needs grew, this monolithic PoC had to be rewritten and split into various services, etc.

It became time to use proper tools.

The right tool for our needs

Every cloud provider has their own solution, but we wanted to be as provider agnostic as we could. In fact, we started to use the right provider for the right usage: when it comes to object storage AWS S3 looks like a good choice, but when we want disposable servers to run stateless memory hungry daemons Hetzner is a better choice.

We decided to use Terraform to configure our services. It lets you describe states of the infrastructure you want in plain text and takes care of calling the providers' API to provision some cloud resources. Resources and data are defined through *.tf files and the generated state is stored by Terraform in a *.tfstate file (local or in the cloud as we will see later on). It also permits to have an overview of what changes we are going to make using the plan subcommand.

It suits us, because we love plain text, it offers a way of factorising configuration and it is widely used to benefit from various documentation and many providers. A provider is a sort of plugin that you use to interact with an external API. And writing or patching providers is also possible when you're not too reluctant to write Go code. Go is not part of our stack, but we decided we can get by, and in fact we did patch some existing public providers and even wrote some ourselves (even for some of our internal APIs which we can speak about in a future post).

Let's import what we have

We played with examples from the documentation, realised it seems to be a good tool and then we wanted to use it on our infrastructure. A cleaner way would have been to create a new project in a new account from scratch and then deploy our app on it. Due to some changes in the development team, IP allowlisting by not so reactive partners and fearing to loose information about assets used during tasks run only occasionally we decided to go the other way around. Importing all the resources we have and then factorise it step by step.

How to import existing resources in Terraform?

We started by importing our AWS infrastructure in Terraform using the ruby gem terraforming. This gem provided us with all the resources in *.tf files and even an initial *.tfstate file. It was a great fit for our first import.

For other imports we decided to go with another strategy, i.e. generate *.tf files with a script and then update Terraform's state using the Terraform import subcommand.

What do the generated *.tf files look like?

Let's be honest, even if it's more accessible than through a web console, it's still a mess. Resources were referenced between each other by internal meaningless ids, resources were grouped by type rather than by services… Even resource names were sometimes not that human friendly.

Dealing with imported files

Cleaning unused resources

As a first step, we cleared unused resources: that old test s3 bucket, this stopped instance, … At each step of the cleanup the plan command let us make sure we are not deleting anything else. We still have something not very readable but with almost only useful items.

Introducing variables

This part, even if very useful, was quite tedious. We tried to replace all ids with their expression related to resources. For instance replace all references to "i-12345"by "${aws_instance.app_server_f9b392bb.id}". Once again the plan command was helpful to see if there is anything to change after our modifications. It's a bit like refactoring with a good compiler with even some guaranty that you are not going to break things.

We also used a lot of locals blocks to define values from outside of the imported scope (our office IP address, other servers' IPs).

It's definitely more human friendly.

Renaming resources

We still have funny names created from the web console. It's time to rename this app_server_f9b392bb to a more friendly name like app_server_staging. However, the naive solution of just changing the resource line in the terraform *.tf files is not enough. It will destroy and recreate the resource with the new name. When you are cloud ready it's no big deal but sometimes it's the database server with your live user data and you don't want to mess with it.

Terraform allows state manipulation with the state command, in our app server case we can manage to do so by editing the *.tf file and then issue the state mv …old …new  command:

terraform state mv \
	aws_instance.app_server_f9b392bb \
	aws_instance.app_server_staging

And sometimes you discover you have resources defined twice for redundancy purposes, you then want to benefit from the count parameter of Terraform, delete one of the definition, add a count=2 on the first resource definition and then issue the following commands:

terraform state mv \
	aws_instance.reverse-proxy-31cdaa42 \
	aws_instance.reverse-proxy[0]
    
terraform state mv \
	aws_instance.reverse-proxy-5eb22398 \
    aws_instance.reverse-proxy[1]

Splitting states

We now have one big state file. Terraform plan command refreshes all resources at each run, it's not that long if the API answers fast, but since we continue to grow, this duration will increase, and sometimes external providers' APIs are slow.

Let's split the state by concerns, for example one sub state per provider and per environment. We need to create a folder by provider and inside it one by environment, define a new state in each folder and then start moving code from one place to another.
Exactly like when we renamed resources, running the Terraform plan command in the previous root folder will ask to delete the resource, the command execution in the new folder will ask to recreate it. Thus the Terraform state command will help us again but this time with the rm option:

cd aws/production/ # Our brand new folder
terraform import aws_instance.web i-12345678

cd -
terraform state rm aws_instance.web


Terraform has an official way of splitting states by using their workspaces feature. However we decided not to use it, mostly because it is dependent on the backend storage used for your state which itself depends on the provider's implementation. Also we found our plain old file system “directories” feature more than satisfactory for our needs.

For even more clarity we created a modules/ directory to share common configuration (when we want to share logic between two environments for example) and a providers/ directory to store the configuration and states. Our repository structure basically follows the recommendations from the Terraform documentation and looks like this:

.
├── modules/ # Used to share common logic
└── providers/
    ├── aws
    │   ├── production
    │   │   ├── terraform.tfstate
    │   │   ├── aws.tf
    │   │   └── *.tf
    │   └── staging
    │       ├── terraform.tfstate
    │       ├── aws.tf
    │       └── *.tf
    ├── hetzner
    │   └── production
    │       ├── terraform.tfstate
    │       ├── hetzner.tf
    │       └── *.tf
    └── ovh
        ├── production
        │   ├── terraform.tfstate
        │   ├── ovh.tf
        │   └── *.tf
        └── staging
            ├── terraform.tfstate
            ├── ovh.tf
            └── *.tf

Working in a team

As you can see in the previous paragraph we are starting to have multiple state files (*.tfstate) and most importantly they are all local files on one person's computer. That is really not practical when we are working in a team (i.e. more than 1 person).

This is where Terraform comes in handy as we can store those state files remotely and also lock each of those files to make sure we don't run resource changes in parallel.

We decided to move all those state files on AWS S3 object storage and configured a remote lock with an AWS DynamoDB table. Now each member of the team can make changes safely and we are sure not to step on each other's toes.

Cleaning it in team

Introducing migration

As we saw in the previous part, renaming or moving resources implies accessing the Terraform state. The state can easily be locked, as we already saw, however if someone else runs Terraform with an outdated version of your code they will destroy your changes or see a very complicated plan not related to their changes. It is thus essential to keep your Terraform code in a version control system (git most probably) and follow coding practices with an upstream master branch.

What happens now when someone is moving resources around in your code and issuing an non-idempotent command (such as the state mv or state rm commands explained previously) in its own branch? You will have no guarantee that another team member is not planning some changes on the same resources.

To handle this issue we thought about the similarity of handling database schema migration in code.

Hence, every pull requests on our configuration code which needs manual changes on the remote state should now come with a migration to update the state. Please note that the Terraform apply command will naturally update the state automatically for us when creating resources (or modifying existing resources) without the need for migrations. However migrations will be needed when refactoring our code base (basically each time you need to run a state  mv or state rm or import subcommands)

Migration in practice

At Fretlink we love typed languages, test suites and clean code but we also have a thing for bash (mainly in our so called devoops team).
So we decided to write a small wrapper to deal with migration.

Storing migration version

We added to our root folder (where we imported all our resources at the beginning) a one line text file with the version inside called version. And to add it to terraform state we added a version.tf with the following content:

data "local_file" "version" {
  filename = "${path.module}/version"
}

Storing migration logic

We just store migrations in a folder migrations/{version}. Each migration is made of two files up.sh and down.sh.

Each migration script contains a set of terraform import or terraform state mv|rm calls.

Running migrations

We have a simple wrapper script written in bash migrations/migrate.sh which:

  • reads the version known from the Terraform state (with a simple terraform state show "local_file.version" command)
  • reads the local file version (basically cat version)
  • then executes migration steps (up or down migrations) depending on the result of comparing the two version numbers read before.
  • if everything went well it will refresh the Terraform state with the latest version

E.g.

  • if the Terraform state version contains number 003 and our local version file contains number 005, the migration script will execute migrations/004/up.sh and migrations/005/up.sh scripts and update the Terraform state version to 005.
  • if the Terraform state version contains number 006 and our local version file contains number 005, the migration script will execute migrations/006/down.sh script and update the Terraform state version to 005.

Finally our repository structure for those migrations looks like the following:

…
├── migrations
│   ├── 001
│   │   ├── down.sh
│   │   └── up.sh
│   ├── 002
│   │   ├── down.sh
│   │   └── up.sh
│   ├── 003
│   │   ├── down.sh
│   │   └── up.sh
│   ├── 004
│   │   ├── down.sh
│   │   └── up.sh
│   ├── 005
│   │   ├── down.sh
│   │   └── up.sh
│   ├── 006
│   │   ├── down.sh
│   │   └── up.sh
│   └── migrate.sh
├── version
└── version.tf

Conclusion

With a bit of tooling (simple enough to be done in bash) we managed to get control over our infrastructure resources and make them evolve without outage as a team. Very few of the things we imported in the beginning are still there which means we did quite some job to earn our devoops medal.

It's not a state of the art infrastructure with every trick and best practices that Terraform has to offer, as some of us may have seen in conferences, but we are confident to make changes to it. We have some continuous integration tests checking our code validity automatically on every change, we have an auditable trace of every infrastructure change through our code, developers (non devoops friendly) are not afraid to write PRs for changes themselves (we did not mention it but this is a huge bonus) and it almost never failed us.

If you are doubting on how to handle your infrastructure resources and getting worried of your continuously growing needs. If you are tired of logging in to your cloud provider's web console. If you want visibility on your infrastructure through versioned and auditable code then go try out Terraform. Really just go and try it out, start with a local state file, start importing a few of your existing resources and start issuing your first terraform plan commands to see the result. It's non destructive and will do no harm, see it as a “dry-run” of your potential infrastructure changes. We are sure you will enjoy the ride!

Have fun, stay free and stay kind.