Tim Birkett

Taming Terraform with Modules

Tim Birkett — Thu, 03 Oct 2019 00:00:00 GMT

Some of us know what Terraform is. It’s an Open Source tool created and maintained by Hashicorp to help us specify and manage our Infrastructure as Code across multiple Infrastructure as a Service (IaaS) and Cloud providers.

When you start learning Terraform you make use of it’s primitive types: data sources and resource statements. If you’re trying it out and simply creating a Ghost publication in The Cloud these primitives do the job. It’s a great way to tie together your understandings of the building blocks to make something on whatever Cloud provider you’re using. Typically, a VPC, a few Subnets, A Routing table, a Nat Gateway, an ELB, a handful of Security Groups, an Instance and a Database will have you up and running with a shiny new Ghost publication. A bit of an expensive Ghost publication.

Unfortunately, most of us aren’t doing simple, one-off things in our day jobs. We’re building complex and often evolutionary infrastructures across multiple projects, products, teams, accounts, locations and cloud providers. Handcrafting all of that Terraform code from the primitives offered would be a nightmare from the beginning. Maintaining consistency across all the things would be pretty much impossible. I’ve been there and I ran away. Fast! This is where Terraform modules help us out.

Terraform has some good documenation on writing basic modules:

https://www.terraform.io/docs/modules/index.html

Modules allow us to create reusable collections of resources that are often used together to do something and expose them as a single module resource in Terraform.

As an example: EC2 instances will usually belong to an Autoscaling Group, have a Launch Configuration and be attached to some Security Groups. By bundling these into a module you can reuse the same code by passing different parameters into the module and you get a similar result each time. Before we break out the module shotgun it’s worth understanding that there are multiple types of module.

Component Modules

A component module has a focussed job and only interacts with a single cloud provider and only uses the low-level Terraform resources. It might group together an ASG, Launch Configuration and Security Group or a VPC, Subnets, Route Tables and Internet / NAT Gateways.

Component modules are typically the type of module you’ll find on the Terraform Registry. Some good examples are:

Infrastructure or Service Modules

So, you’ve started using upstream component modules and you start to realise that you’re doing the same things, adding the same tags, using autoscaling, security-group and alb together to create a service and inputting a lot of default boilerplate values.

This is where Infrastructure modules take over. You can use Terraform modules within other modules. You can create modules that enforce best practice or certain requirements and create things in a consistent way across your infrastructures. They sometimes make use of multiple providers. Some examples:

Data Modules

Modules that do nothing more than give static output for use in other modules. They might be used to enforce available instance types, provide some data like an instance price to use for spot prices or normalise instance sizes across cloud providers. An example of a data module might be something like:

Hopefully some of this makes sense and can help you evolve your Terraform code into something more beautiful than it was previously.

Something I haven’t touched on here is managing Terraform module versions and dependencies. I might braindump a post on this soon. Thanks for reading!

Running Tasks Based on Public Holidays

Tim Birkett — Mon, 23 Jul 2018 00:00:00 GMT

Over the past few years I’ve worked for various financial organizations in different areas. Something that has come up quite often is public holidays or bank holidays in the UK. They’re a non-working day for most people but I’ve still encountered organisations that need certain things like cron jobs to run or not to run on these days.

Recently the scheduled task in question was a file transfer to a payment provider which required manual acceptance from someone in the business. I noticed a calendar invite pop-up for me to "disable epayment file transfer" before every bank holiday and another calendar invite to "re-enable epayment file transfer" before the next file transfer at 6am. HELL NO! I hate repetitive manual things with a passion, even if it’s only 8 days a year.

I began my search for a cli tool to help me. Failing to find such a tool I began a hunt for a Python package to help me out as the scheduled task was written in Python. I found a few packages but some required a call to external web services. Then I found the holidays package: https://pypi.org/project/holidays/

As I started to write the code to handle public holidays… import holidays… I stopped myself. I had originally been searching for a cli tool to run as part of the systemd timer so I could run something like this:

is_it_a_holiday_command || /opt/scripts/epayment_transfer

So, rather than building the logic into the transfer script I wrote a cli tool to do it.

The tool which I’ve named publicholiday is on Github: https://github.com/timbirk/python-publicholiday and has been published to PyPi: https://pypi.org/project/publicholiday/

Installation

Installation is as simple as: pip install publicholiday

Usage

Using publicholiday is easy, the command will exit with a status code of 0 if today is a public holiday and exit with a status code of 1 if today is not a public holiday:

$ publicholiday --help
Usage: publicholiday [OPTIONS]

  Is it a public holiday?

Options:
  -c, --country TEXT  Supported country name or code.
  --help              Show this message and exit.

# Run a script on a public holiday
$ publicholiday && /thing/to/run.sh

# Don't run a script on public holidays (run it on all other days).
$ publicholiday || /thing/to/run.sh

By default, publicholiday uses UK bank holidays. It is possible to pass in a country using either the name or the short code as defined [here](https://pypi.org/project/holidays/). Examples:

# Run a script on a Argentinian public holiday
$ publicholiday -c Argentina && /thing/to/run.sh

# Don't run a script on US public holidays
$ publicholiday -c US || /thing/to/run.sh

Currently, only countries are supported. Province / state level holidays would be reasonably easy to implement and I’d be open to a PR with tests and documentation for that but it’s outside of the scope of my current needs.

Hopefully if you stumble upon this post it helps solve your problem.

Securing Jenkins Workspaces

Tim Birkett — Sat, 09 Jun 2018 00:00:00 GMT

Jenkins is a powerful tool for building, packaging and deploying your software. It can stand in as an ad-hoc task runner, an orchestrator and replace your cron jobs to give some visibility into what tasks have run, when and by who or what.

I’ve been using Jenkins for most of my tech life and think that in terms of flexibility, automation and power it’s one of the best tools to have in your box. There is one problem I have with Jenkins, by default it isn’t very secure.

Even after following Securing Jenkins it has amongst other things something which can be a security concern: Workspace access through the web UI.

By default you have 3 options to control access to job workspaces:

they are readable (anonymously or by a group / specified users)
they are not readable
they are deleted / cleaned up after a build

I generally stick to option 2 and 3, remove read access and clean the job workspace post build. This keeps build agents disk space happy and reduces leaking secrets or sensitive information out through logs or files in the workspace.

Overview

I’ve setup automated internal CA’s to handle issuing certificates and re-generating them on expiry. In recent years, I’ve moved to using LetsEncrypt wherever possible.

I’ve found that in cloud environments that often scale up and down, running an agent like certbot on each instance is an anti-pattern.

Each time an instance is provisioned, it requests some certificates, LetsEncrypt will issue a new certificate to each instance. If you’ve ever tried the same thing, you’ll find that you can hit LetsEncrypt limits pretty quickly.

To solve this problem I created a Jenkins job to handle the certificate generation with the Certbot Docker container (lego is a good alternative), build an rpm package containing the certificates and push the rpm up to our private yum repos. From there all instances can have the cp-certs package installed as part of there bootstrapping / initial provisioning.

The Problem

This job runs on a schedule every 7 days. On the first run of certbot, various configuration files and private keys are generated. Whilst Jenkins itself is relatively secure I don’t like the idea of these files sitting there in the workspace for anyone to get hold of, try to use, accidentally commit, use to decode https traffic…

The Solution

Initially I hunted round for a while for some sort of "Secure" or "Hidden" workspace plugin. There’s nothing. I sat and thought: "What is the actual risk I’m trying to reduce?" and came up with:

access via the Jenkins UI
user access on the file system

It dawned on me that the solution might be easier than I thought (damn brain getting in the way).

If I move the relevant sensitive files out of the workspace, there’s no longer a risk of someone borrowing them from the UI and if I encrypt / archive the files with a strong secret then they’ll be reasonably secure on the file system.

Obviously, someone who knows or can access the secret (me) and has access to the file system as the root or Jenkins user (me) could get at them but it’s a reasonable attempt at improving security.

How do you do it?

⚠️ You’ll need GPG installed on your Jenkins CI agents.

First, create a strong secret in Jenkins credential store. In the job you want to secure add a credentials binding to expose the strong secret as an environment variable. SECURE_KEY in this example.

Add something like the following to your job as the first shell step:

#!/bin/bash
set -eu -o pipefail

#
# Check for encrypted workspace archive and extract or create directory.
#

SECURE_DIR="/var/lib/jenkins/secured-workspace/$JOB_NAME"

if [[ -f "${SECURE_DIR}/secured_state.tgz.gpg" ]]
then
    echo 'INFO: Restoring secured workspace'
    gpg --yes --batch --passphrase=${SECURE_KEY} \
        ${SECURE_DIR}/secured_state.tgz.gpg

    tar -xzf ${SECURE_DIR}/secured_state.tgz
else
    echo 'INFO: Creating directory for secured workspace'
    mkdir -m 700 -p ${SECURE_DIR}
fi
exit 0

As the last shell step or a postbuild script something like the following should work:

#!/bin/bash
set -eux -o pipefail

#
# GPG encrypt workspace and remove all insecure files.
#

SECURE_DIR="/var/lib/jenkins/secured-workspace/$JOB_NAME"

echo 'INFO: Creating secured workspace'
tar -czf ${SECURE_DIR}/secured_state.tgz .
gpg --yes --batch --passphrase=${SECURE_KEY} -c ${SECURE_DIR}/secured_state.tgz
rm -rf ${SECURE_DIR}/secured_state.tgz ${WORKSPACE}/* ${WORKSPACE}/.*

At this point you’ll have a workspace with only safe files in it. You can always apply a workspace cleanup after the archive / encryption has run to get rid of all files.

Hopefully you find this helpful if you need to secure your workspace contents. Thanks for reading!

Testing DNS Infrastructure with Goss

Tim Birkett — Mon, 20 Feb 2017 00:00:00 GMT

In a previous post we got an introduction to "Easy Infrastructure Testing with Goss".

In this post we’ll take a look at a feature I added to Goss a while ago. Enhanced DNS validation.

Why test DNS?

DNS is easy right?! It’s just an IP address and a hostname. Easy… We’ve definitely never had an outage or failed to deploy a new application because of a DNS issue have we?

DNS can get a little more interesting when you start chaining CNAMEs, have multiple A records for a hostname and introduce DNSSEC.

PTR records which reverse map an IP to a hostname are often used by various server applications for security purposes (Java + SSL).

If DNS configuration is out of your control and another team forgets to add the records you need correctly you can end up wasting hours troubleshooting why various applications won’t start up, clients fail to connect and you have SSL connection errors.

Testing your DNS with Goss will solve ALL these problems! Okay, that’s a lie. It can however help you identify when DNS records aren’t quite right, have changed, or are missing before deploying a new application.

What can Goss test?

Goss can validate that any of the following record types are resolveable and can validate the values of the records.

A
AAAA
CAA
CNAME
MX
NS
PTR
SRV
TXT

How do I test DNS records?

Here are a few examples of DNS record tests:

dns:
  # Validate a CAA record
  CAA:dnstest.io:
    resolvable: true
    addrs:
    - 0 issue comodoca.com
    - 0 issue letsencrypt.org
    - 0 issuewild ;
    timeout: 2000
    server: 8.8.8.8

  # Validate a CNAME record
  CNAME:dnstest.github.io:
    resolvable: true
    server: 8.8.8.8
    addrs:
    - "github.map.fastly.net."

  # Validate a PTR record
  PTR:8.8.8.8:
    resolvable: true
    server: 8.8.8.8
    addrs:
    - "google-public-dns-a.google.com."

  # Validate and SRV record
  SRV:_https._tcp.dnstest.io:
    resolvable: true
    server: 8.8.8.8
    addrs:
    - "0 5 443 a.dnstest.io."
    - "10 10 443 b.dnstest.io."

  # Validate an MX record
  MX:dnstest.io:
    resolvable: true
    addrs:
    - 10 b.dnstest.io.
    - 5 a.dnstest.io.
    timeout: 2000
    server: 8.8.8.8

The above examples will query Google’s public DNS server: 8.8.8.8 for results. You can remove the server parameter which will result in the system DNS resolver being used.

Combining this with the nagios output and creating a monitoring check from it could be helpful in identifying future issues or alerting when a record might have been "cleaned up".

Easy Infrastructure Testing with Goss

Tim Birkett — Tue, 10 Jan 2017 00:00:00 GMT

In the world of infrastructure; new servers, VMs, containers or applications are often manually validated by a human. Some form of build document or checkbox exercise takes place to confirm that the piece of infrastructure is ready for use. Even with configuration management, mistakes happen and humans make mistakes (damn humans).

Wouldn’t it be great if there were an easy way to automatically validate new servers before they are live and we find the problems in production? Well, there is! Try Goss.

What is Goss?

Goss is a tool that let’s you easily and quickly validate infrastructure. Like Serverspec but without all the code. Goss allows you to define what a piece of infrastructure should look like with YAML or JSON. This is made even easier for us with the ability to auto add resources to the Goss configuration on the command line.

Goss allows you to validate many different resource types such as files, users, groups, packages, services and http connectivity. You can read the full Goss documentation here.

Why Goss?

Well, a few things make goss an awesome tool for server validation:

Written in Go - This means it’s a self contained binary with no dependencies on other libraries or interpreters.
It’s super fast - Taking advantage of Go’s concurrency model: tests are executed and returned almost instantly.
It’s easy to get started with - Defining resources in YAML or JSON makes it easy for your entire team to get to grips with.

An Example

We build a web server running Apache.

Before going live someone checks that the Apache httpd package is the correct version of 2.4.25, the deployment user is in the www-data group and there is an application directory at /srv/www/app. They also check the httpd service is running, we can connect to the application at http://localhost/app and going to http://localhost/ gives a 404 error page.

To automate the above procedure with Goss, the YAML configuration would look like this:

---
package:
  httpd:
    installed: true
    versions:
    - 2.4.25
user:
  deployment:
    exists: true
    groups:
    - deployment
    - www-data
file:
  /srv/www/app:
    exists: true
    filetype: directory
service:
  httpd:
    enabled: true
    running: true
http:
  http://localhost/app:
    status: 200
    timeout: 1000
  http://localhost:
    status: 404
    timeout: 1000

You can see clearly what is being validated and this goss.yaml file can be used to consistently validate all servers of the same configuration.

Server Validation and Monitoring

All this validation of files, processes, services, ports and connectiovity sounds familiar. It’s something that we quite often try to achieve with our monitoring tools like Nagios, Zabbix or Sensu.

With Goss we can create a single monitoring check that tests many resources at once. Goss has several different outputs. The nagios_verbose output gives you out of the box testing compatible with Nagios or Sensu and gives Nagios long output explaining failures:

Server validation can now become part of your monitoring ecosystem ensuring that any problems are identified quickly with a single monitoring check.

Goss won’t replace all of your checks, it doesn’t check things like HDD space, RAM usage or errors in log files. But it makes a sweet addition to server / service monitoring.

HTTP Health Endpoint

Many applications expose "health" endpoints for applications and services.

Tools like Google’s Borgmon monitoring system and Prometheus use this HTTP(S) scrape or pull model to retrieve monitoring and metrics results from there services.

Goss has a serve command that exposes a http endpoint for scraping. You can then then use something like PHP Server Monitor to show the validation status of each piece of infrastructure.

If instrumeting your applications interests you, there’s plenty of libraries to assist you. Check out: Dropwizard Metrics and Django Health Check.

Final Thoughts

Hopefully you can see the value of validation testing your infrastructure and building validation into your monitoring systems.

Goss is a young project and currently only supports Linux but it’s very active and open to contributions. You can help out by opening issues, discussing enhancements and submitting pull requests for review.

I’ll cover some advanced Goss usage in future posts. Thanks for reading!

Puppet Anti-Patterns

Tim Birkett — Thu, 10 Nov 2016 00:00:00 GMT

Over my years in the tech industry I’ve gained a lot of experience with Configuration Managemnt tools such as Puppet, Chef and Ansible. In this post I’d like to share with you my experiences, opinions and advice on using Puppet as a Configuration Management tool. Hopefully this helps some of you out there to beat Puppet into submission.

Let’s just jump straight in with the patterns that aren’t too great.

Everything in Manifests

In many beginners tutorials you get taught to put all your code in manifests such as site.pp or nodes.pp. For example:

node 'puppetclient1.mydomain.net' {
  include httpd_class
}

node 'puppetclient2.mydomain.net' {
  include nginx_class
  file {'/opt/deployment_script':
    ensure => 'file',
    owner  => 'deploy',
    group  => 'deploy',
    mode   => '0750'
  }
}

node default {
  package { 'perl':
    ensure => present
  }
}

This is great when you’re just starting out with a few servers to manage. You think you get it. Then you add a few more servers, you start adding more node specific config and before you know it you’ve got 10'000 lines of hand-crafted artisanal Puppet code. This was common in the early days of Puppet use. It was how I started back with Puppet 0.24.

Although, it’s not the best idea to manage your infrastructure in this way it’s actually a reasonably good way to very easily and simply bootstrap cloud instances. with separate manifests based on server type (web.pp, app.pp, lb.pp etc). These can than be applied using cloudinit to create an immutable bootstrapped node.

Monolithic `modules` Directory

Quite often I see repos where people have puppet module install’d straight into the `modules directory or they’ve downloaded a module and extracted it there. The whole repo including their own modules mixed in with upstream modules is then committed to source control.

This pattern has a few problems:

you don’t know what is a locally developed module and what is an upstream module
there’s no way of easily seeing what versions of modules are deployed
it adds a lot of extra code to your Puppet repository

Although this way works and you know that your module versions are pinned, tools are out there that make it much easier to manage your Puppet modules such as librarian-puppet and r10k.

Configuration Data in Code

When writing Puppet code it’s sometimes tempting to hard-code things like IP addresses or node specific things. For example:

# DNS Config
class profile::base::dns{
  $dns_servers = ['192.168.1.1', '192.168.1.2']
  file { '/etc/resolv.conf':
    ensure  => present,
    owner   => 'root',
    group   => 'root',
    mode    => '0444',
    content => template('etc/resolv.conf.erb')
  }
}

This works, but the code isn’t re-usable. If you deploy to a different network or DC are your DNS servers still the same?

To improve re-usablility, change the variable to be a class parameter with an optional default value:

# DNS Config
class profile::base::dns (
  $dns_servers = ['8.8.8.8', '8.8.4.4']
){
  file { '/etc/resolv.conf':
    ensure  => present,
    owner   => 'root',
    group   => 'root',
    mode    => '0444',
    content => template('etc/resolv.conf.erb')
  }
}

Then you can specify environment (network, node, DC) specific configuration in Hiera:

dc1.example.com.yaml:

---
profile::base::dns:
  dns_servers:
    - '192.168.1.1'
    - '192.168.1.2'

dc2.example.com.yaml:

---
profile::base::dns:
  dns_servers:
    - '10.10.0.1'
    - '10.10.1.1'

Now you avoid multiple classes or writing case or if {…} else {…} logic in your class file.

Everything in Separate Repos

In the example above, I’ve mentioned a "profile" class. A common pattern when dealing with Puppet code is the Roles and Profiles Pattern. The idea is that you assign one "Role" per server and the role is made up of individual bite-sized "Profiles". Roles and profiles are just you or your companies custom Puppet modules that make use of upstream modules or Puppet resources to configure systems.

A common and problematic pattern I’ve seen is maintaining an SCM repo for roles, an SCM repo for profiles, an SCM repo for the "control repo" and a separate SCM repo for hieradata. This can then turn into multiple branches of each and before you know it you’re maintaining 12 versions of a diverging code base. It’s common to see this when people have followed some "best practice" blog post without fully thinking things through. You also often see this when developers created the code initially. They love git branches and complexity. Especially if trying to bring Gitflow to Puppet code.

Here’s my top tip when deciding how to organize your repos: KISS - Have no more than one repo for your products Puppet code. Multiple repos and branches leads to managing multiple very different code bases, different module versions in each environment, merge tasks, complex git fixing… a nightmare. There is an awesome control-repo by the guys at Example42 here: https://github.com/example42/control-repo

Misuse of Puppet Environments

With the r10k tool it’s easy to manage environments dynamically with Git branches. You can then assign subsets of servers to use different environments (branches of Puppet code). This sounds great, we have applications deployed to different environments like test, dev, stage, uat or production. Cool! let’s create all those branches… STOP!!

Puppet environments are a powerful thing, but I don’t believe you should confuse Puppet environments with Application environments. You should aim to manage your infrastructure in a single Puppet environment: "production". If you arrange your hieradata hierarchy sensibly you can manage the differences in configuration in a single branch.

Puppet branches should be used when you need to test big changes or new features out in a controlled way (of course you’re developing and testing on Vagrant). Create a branch like "new_feature", develop it locally testing it on Vagrant then test out the changes on a suitable piece of infrastructure by running puppet apply --environment new_feature. Don’t forget your security and perf testing at this point ;) if everything looks good, open a PR, get it reviewed, merged to production and delete the branch.

Closing Thoughts

Most of the opinions I’ve formed are from being forced to work with some painful Puppet code setups and processes. My best advice to anyone developing their infrastructure code: Keep it simple, think about it before you go ahead, keep an open mind and don’t be afraid to change your mind or refactor when necessary.