1-888-317-7920 info@2ndwatch.com

AWS re:Invent 2018: Daily Recap – Wednesday

Every year AWS re:Invent gets bigger and better. There are more people attending and even more who will participate remotely than any previous year. There are also more vendors showing the strength of the AWS ecosystem.

You realized why when Andy Jassy started his keynote session Wednesday morning.  The growth rate of AWS is phenomenal.  Adoption is up, revenues are up and AWS responds with customer-driven changes. Three years ago, there were less than 100 AWS services out here, and now, with yesterday’s announcements, there are more than 140. Jassy discussed a lot at the keynote, but the focus was on three major themes:

Storage/Database

The first theme was around Storage/Database with services such as Amazon FSx, which provides a platform for such things as FSx for Windows File Server. This is like Amazon EFS, but instead of supporting the NFS protocol it supports the SMB protocol. For those running workloads on Windows, you now have a shared filesystem. If you need a file system for High Performance Computing cluster, then FSx supports Lustre. I would look for more protocols and services in the future.

FSx was just the tip of the iceberg with new options DynamoDB Read/Write Capacity On Demand, another storage tier for Glacier called Deep Archive, a time-oriented database named Timestream, a fully managed ledger database – QLDB and even a Managed Blockchain service.  Read more about these from AWS:

Glacier Deep Archive
Amazon FSx for Windows File Servers
Amazon FSx for Lustre
DynamoDB Read/Write Capacity On Demand
Amazon Timestream
Amazon Quantum Ledger Database
Amazon Managed Blockchain

Security

The second theme was around Security.  It surprises no one that AWS is always expanding their offerings in this space.  They are fond of saying that security is Job One at AWS.  Two interesting announcements here were AWS Control Tower and AWS Security Hub. These will assist in many aspects of managing your AWS accounts and increasing your security posture across your entire AWS account footprint.

Machine Learning/Artificial Intelligence

The final theme was around Machine Learning/Artificial Intelligence. We see a lot of effort being put into AWS’ Machine Learning and Artificial Intelligence solutions. This shows with the number of announcements this year. New Sagemaker offerings, Elastic Inference, and even their own specialized chip all point to a focus in this area.

Amazon Elastic Inference
AWS Inferentia
Amazon SageMaker Ground Truth
AWS Marketplace for machine learning
Amazon SageMaker RL
AWS DeepRacer

Amazon Textract
Amazon Personalize
Amazon Forecast

And we can’t forget the cool toy of the show – DeepRacer. Like Amazon DeepLens from last year, this “toy” car will help you explore machine learning. It has sensors and compute onboard, so you can teach it how to drive. There’s even a DeepRacer League, where you can compete for a trophy at AWS re:Invent 2019!

Outposts

Although not one of the three main themes, and not available until 2019, AWS Outposts was another exciting feature yesterday. Want to run your own “region” in your datacenter? Take a look at this. It is fully-managed, maintained and supported infrastructure for your datacenter. It comes in two variants – 1) VMware Cloud on AWS Outposts, which allows you to use the same VMware control plane and APIs you use to run your infrastructure and, 2) AWS native variant of AWS Outposts allows you to use the same exact APIs and control plane you use to run in the AWS cloud, but on-premises.

If you can’t come to the cloud, it can come to you.

Sessions and Events

There are more sessions than ever at this year’s re:Invent, and the conference agenda is full of interesting and useful events and demos. It’s always great to know that, even if you missed a session, you can stream it on-demand later on the AWS re:Invent YouTube channel. And we can’t forget the expo hall, which has been very heavily-trafficked. If you haven’t yet, stop by and see 2nd Watch in booth 2440. We’re giving away one more of those awesome Amazon DeepLens cameras we mentioned earlier in this post. This year’s re:Invent shows that AWS is bigger and better than ever!

David Nettles – Solutions Architect

Facebooktwittergoogle_pluslinkedinmailrss

Fully Coded And Automated CI/CD Pipelines: The Weeds

The Why

In my last post we went over why we’d want to go the CI/CD/Automated route and the more cultural reasons of why it is so beneficial. In this post, we’re going to delve a little bit deeper and examine the technical side of tooling. Remember, a primary point of doing a release is mitigating risk. CI/CD is all about mitigating risk… fast.

There’s a Process

The previous article noted that you can’t do CI/CD without building on a set of steps, and I’m going to take this approach here as well. Unsurprisingly, we’ll follow the steps we laid out in the “Why” article, and tackle each in turn.

Step I: Automated Testing

You must automate your testing. There is no other way to describe this. In this particular step however, we can concentrate on unit testing: Testing the small chunks of code you produce (usually functions or methods). There’s some chatter about TDD (Test Driven Development) vs BDD (Behavior Driven Development) in the development community, but I don’t think it really matters, just so long as you are writing test code along side your production code. On our team, we prefer the BDD style testing paradigm. I’ve always liked the symantically descriptive nature of BDD testing over strictly code-driven ones. However, it should be said that both are effective and any is better than none, so this is more of a personal preference. On our team we’ve been coding in golang, and our BDD framework of choice is the Ginkgo/Gomega combo.

Here’s a snippet of one of our tests that’s not entirely simple:

Describe("IsValidFormat", func() {
  for _, check := range AvailableFormats {
    Context("when checking "+check, func() {
      It("should return true", func() {
        Ω(IsValidFormat(check)).To(BeTrue())
      })
    })
  }
 
  Context("when checking foo", func() {
    It("should return false", func() {
      Ω(IsValidFormat("foo")).To(BeFalse())
    })
  })
)

So as you can see, the Ginkgo (ie: BDD) formatting is pretty descriptive about what’s happening. I can instantly understand what’s expected. The function IsValidFormat, should return true given the range (list) of AvailableFormats. A format of foo (which is not a valid format) should return false. It’s both tested and understandable to the future change agent (me or someone else).

Step II: Continuous Integration

Continuous Integration takes Step 1 further, in that it brings all the changes to your codebase to a singular point, and building an object for deployment. This means you’ll need an external system to automatically handle merges / pushes. We use Jenkins as our automation server, running it in Kubernetes using the Pipeline style of job description. I’ll get into the way we do our builds using Make in a bit, but the fact we can include our build code in with our projects is a huge win.

Here’s a (modified) Jenkinsfile we use for one of our CI jobs:

def notifyFailed() {
  slackSend (color: '#FF0000', message: "FAILED: '${env.JOB_NAME} [${env.BUILD_NUMBER}]' (${env.BUILD_URL})")
}
 
podTemplate(
  label: 'fooProject-build',
  containers: [
    containerTemplate(
      name: 'jnlp',
      image: 'some.link.to.a.container:latest',
      args: '${computer.jnlpmac} ${computer.name}',
      alwaysPullImage: true,
    ),
    containerTemplate(
      name: 'image-builder',
      image: 'some.link.to.another.container:latest',
      ttyEnabled: true,
      alwaysPullImage: true,
      command: 'cat'
    ),
  ],
  volumes: [
    hostPathVolume(
      hostPath: '/var/run/docker.sock',
      mountPath: '/var/run/docker.sock'
    ),
    hostPathVolume(
      hostPath: '/home/jenkins/workspace/fooProject',
      mountPath: '/home/jenkins/workspace/fooProject'
    ),
    secretVolume(
      secretName: 'jenkins-creds-for-aws',
      mountPath: '/home/jenkins/.aws-jenkins'
    ),
    hostPathVolume(
      hostPath: '/home/jenkins/.aws',
      mountPath: '/home/jenkins/.aws'
    )
  ]
)
{
  node ('fooProject-build') {
    try {
      checkout scm
 
      wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'XTerm']) {
        container('image-builder'){
          stage('Prep') {
            sh '''
              cp /home/jenkins/.aws-jenkins/config /home/jenkins/.aws/.
              cp /home/jenkins/.aws-jenkins/credentials /home/jenkins/.aws/.
              make get_images
            '''
          }
 
          stage('Unit Test'){
            sh '''
              make test
              make profile
            '''
          }
 
          step([
            $class:              'CoberturaPublisher',
            autoUpdateHealth:    false,
            autoUpdateStability: false,
            coberturaReportFile: 'report.xml',
            failUnhealthy:       false,
            failUnstable:        false,
            maxNumberOfBuilds:   0,
            sourceEncoding:      'ASCII',
            zoomCoverageChart:   false
          ])
 
          stage('Build and Push Container'){
            sh '''
              make push
            '''
          }
        }
      }
 
      stage('Integration'){
        container('image-builder') {
          sh '''
            make deploy_integration
            make toggle_integration_service
          '''
        }
        try {
          wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'XTerm']) {
            container('image-builder') {
              sh '''
                sleep 45
                export KUBE_INTEGRATION=https://fooProject-integration
                export SKIP_TEST_SERVER=true
                make integration
              '''
            }
          }
        } catch(e) {
          container('image-builder'){
            sh '''
              make clean
            '''
          }
          throw(e)
        }
      }
 
      stage('Deploy to Production'){
        container('image-builder') {
          sh '''
            make clean
            make deploy_dev
          '''
        }
      }
    } catch(e) {
      container('image-builder'){
        sh '''
          make clean
        '''
      }
      currentBuild.result = 'FAILED'
      notifyFailed()
      throw(e)
    }
  }
}

There’s a lot going on here, but the important part to notice is that I grabbed this from the project repo. The build instructions are included with the project itself. It’s creating an artifact, running our tests, etc. But it’s all part of our project code base. It’s checked into git. It’s code like all the other code we mess with. The steps are somewhat inconsequential for this level of topic, but it works. We also have it setup to run when there’s a push to github (AND nightly). This ensures that we are continuously running this build and integrating everything that’s happened to the repo in a day. It helps us keep on top of all the possible changes to the repo as well as our environment.

Hey… what’s all that make_ crap?_

Make

Our team uses a lot of tools. We ascribe to the maxim: Use what’s best for the particular situation. I can’t remember every tool we use. Neither can my teammates. Neither can 90% of the people that “do the devops.” I’ve heard a lot of folks say, “No! We must solidify on our toolset!” Let your teams use what they need to get the job done the right way. Now, the fear of experiencing tool “overload” seems like a legitimate one in this scenario, but the problem isn’t the number of tools… it’s how you manage and use use them.

Enter Makefiles! (aka: make)

Make has been a mainstay in the UNIX world for a long time (especially in the C world). It is a build tool that’s utilized to help satisfy dependencies, create system-specific configurations, and compile code from various sources independent of platform. This is fantastic, except, we couldn’t care less about that in the context of our CI/CD Pipelines. We use it because it’s great at running “buildy” commands.

Make is our unifier. It links our Jenkins CI/CD build functionality with our Dev functionality. Specifically, opening up the docker port here in the Jenkinsfile:

volumes: [
  hostPathVolume(
    hostPath: '/var/run/docker.sock',
    mountPath: '/var/run/docker.sock'
  ),

…allows us to run THE SAME COMMANDS WHEN WE’RE DEVELOPING AS WE DO IN OUR CI/CD PROCESS. This socket allows us to run containers from containers, and since Jenkins is running on a container, this allows us to run our toolset containers in Jenkins, using the same commands we’d use in our local dev environment. On our local dev machines, we use docker nearly exclusively as a wrapper to our tools. This ensures we have library, version, and platform consistency on all of our dev environments as well as our build system. We use containers for our prod microservices so production is part of that “chain of consistency” as well. It ensures that we see consistent behavior across the horizon of application development through production. It’s a beautiful thing! We use the Makefile as the means to consistently interface with the docker “tool” across differing environments.

Ok, I know your interest is peaked at this point. (Or at least I really hope it is!)
So here’s a generic makefile we use for many of our projects:

CONTAINER=$(shell basename $$PWD | sed -E 's/^ia-image-//')
.PHONY: install install_exe install_test_exe deploy test
 
install:
    docker pull sweet.path.to.a.repo/$(CONTAINER)
    docker tag sweet.path.to.a.repo/$(CONTAINER):latest $(CONTAINER):latest
 
install_exe:
    if [[ ! -d $(HOME)/bin ]]; then mkdir -p $(HOME)/bin; fi
    echo "docker run -itP -v \$$PWD:/root $(CONTAINER) \"\$$@\"" > $(HOME)/bin/$(CONTAINER)
    chmod u+x $(HOME)/bin/$(CONTAINER)
 
install_test_exe:
    if [[ ! -d $(HOME)/bin ]]; then mkdir -p $(HOME)/bin; fi
    echo "docker run -itP -v \$$PWD:/root $(CONTAINER)-test \"\$$@\"" > $(HOME)/bin/$(CONTAINER)
    chmod u+x $(HOME)/bin/$(CONTAINER)
 
test:
    docker build -t $(CONTAINER)-test .
 
deploy:
    captain push

This is a Makefile we use to build our tooling images. It’s much simpler than our project Makefiles, but I think this illustrates how you can use Make to wrap EVERYTHING you use in your development workflow. This also allows us to settle on similar/consistent terminology between different projects. %> make test? That’ll run the tests regardless if we are working on a golang project or a python lambda project, or in this case, building a test container, and tagging it as whatever-test. Make unifies “all the things.”

This also codifies how to execute the commands. ie: what arguments to pass, what inputs etc. If I can’t even remember the name of the command, I’m not going to remember the arguments. To remedy, I just open up the Makefile, and I can instantly see.

Step III: Continuous Deployment

After the last post (you read it right?), some might have noticed that I skipped the “Delivery” portion of the “CD” pipeline. As far as I’m concerned, there is no “Delivery” in a “Deployment” pipeline. The “Delivery” is the actual deployment of your artifact. Since the ultimate goal should be Depoloyment, I’ve just skipped over that intermediate step.

Okay, sure, if you want to hold off on deploying automatically to Prod, then have that gate. But Dev, Int, QA, etc? Deployment to those non-prod environments should be automated just like the rest of your code.

If you guessed we use make to deploy our code, you’d be right! We put all our deployment code with the project itself, just like the rest of the code concerning that particular object. For services, we use a Dockerfile that describes the service container and several yaml files (e.g. deployment_<env>.yaml) that describe the configurations (e.g. ingress, services, deployments) we use to configure and deploy to our Kubernetes cluster.

Here’s an example:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: sweet-aws-service
    stage: dev
  name: sweet-aws-service-dev
  namespace: sweet-service-namespace
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: sweet-aws-service
      name: sweet-aws-service
    spec:
      containers:
      - name: sweet-aws-service
        image: path.to.repo.for/sweet-aws-service:latest
        imagePullPolicy: Always
        env:
          - name: PORT
            value: "50000"
          - name: TLS_KEY
            valueFrom:
              secretKeyRef:
                name: grpc-tls
                key: key
          - name: TLS_CERT
            valueFrom:
              secretKeyRef:
                name: grpc-tls
                key: cert

This is an example of a deployment into Kubernetes for dev. That %> make deploy_dev from the Jenkinsfile above? That’s pushing this to our Kubernetes cluster.

Conclusion

There is a lot of information to take in here, but there are two points to really take home:

  1. It is totally possible.
  2. Use a unifying tool to… unify your tools. (“one tool to rule them all”)

For us, Point 1 is moot… it’s what we do. For Point 2, we use Make, and we use Make THROUGH THE ENTIRE PROCESS. I use Make locally in dev and on our build server. It ensures we’re using the same commands, the same containers, the same tools to do the same things. Test, integrate (test), and deploy. It’s not just about writing functional code anymore. It’s about writing a functional process to get that code, that value, to your customers!

And remember, as with anything, this stuff get’s easier with practice. So once you start doing it you will get the hang of it and life becomes easier and better. If you’d like some help getting started, download our datasheet to learn about our Modern CI/CD Pipeline.

-Craig Monson, Sr Automation Architect

 

Facebooktwittergoogle_pluslinkedinmailrss

How We Organize Terraform Code at 2nd Watch

When IT organizations adopt infrastructure as code (IaC), the benefits in productivity, quality, and ability to function at scale are manifold. However, the first few steps on the journey to full automation and immutable infrastructure bliss can be a major disruption to a more traditional IT operations team’s established ways of working. One of the common problems faced in adopting infrastructure as code is how to structure the files within a repository in a consistent, intuitive, and scaleable manner. Even IT operations teams whose members have development skills will still face this anxiety-inducing challenge simply because adopting IaC involves new tools whose conventions differ somewhat from more familiar languages and frameworks.

In this blog post, we’ll go over how we structure our IaC repositories within 2nd Watch professional services and managed services engagements with a particular focus on Terraform, an open-source tool by Hashicorp for provisioning infrastructure across multiple cloud providers with a single interface.

First Things First: README.md and .gitignore

The task in any new repository is to create a README file. Many git repositories (especially on Github) have adopted Markdown as a de facto standard format for README files. A good README file will include the following information:

  1. Overview: A brief description of the infrastructure the repo builds. A high-level diagram is often an effective method of expressing this information. 2nd Watch uses LucidChart for general diagrams (exported to PNG or a similar format) and mscgen_js for sequence diagrams.
  2. Pre-requisites: Installation instructions (or links thereto) for any software that must be installed before building or changing the code.
  3. Building The Code: What commands to run in order to build the infrastructure and/or run the tests when applicable. 2nd Watch uses Make in order to provide a single tool with a consistent interface to build all codebases, regardless of language or toolset. If using Make in Windows environments, Windows Subsystem for Linux is recommended for Windows 10 in order to avoid having to write two sets of commands in Makefiles: Bash, and PowerShell.

It’s important that you do not neglect this basic documentation for two reasons (even if you think you’re the only one who will work on the codebase):

  1. The obvious: Writing this critical information down in an easily viewable place makes it easier for other members of your organization to onboard onto your project and will prevent the need for a panicked knowledge transfer when projects change hands.
  2. The not-so-obvious: The act of writing a description of the design clarifies your intent to yourself and will result in a cleaner design and a more coherent repository.

All repositories should also include a .gitignore file with the appropriate settings for Terraform. GitHub’s default Terraform .gitignore is a decent starting point, but in most cases you will not want to ignore .tfvars files because they often contain environment-specific parameters that allow for greater code reuse as we will see later.

Terraform Roots and Multiple Environments

A Terraform root is the unit of work for a single terraform apply command. We group our infrastructure into multiple terraform roots in order to limit our “blast radius” (the amount of damage a single errant terraform apply can cause).

  • Repositories with multiple roots should contain a roots/ directory with a subdirectory for each root (e.g. VPC, one per-application) tf file as the primary entry point.
  • Note that the roots/ directory is optional for repositories that only contain a single root, e.g. infrastructure for an application team which includes only a few resources which should be deployed in concert. In this case, modules/ may be placed in the same directory as tf.
  • Roots which are deployed into multiple environments should include an env/ subdirectory at the same level as tf. Each environment corresponds to a tfvars file under env/ named after the environment, e.g. staging.tfvars. Each .tfvars file contains parameters appropriate for each environment, e.g. EC2 instance sizes.

Here’s what our roots directory might look like for a sample with a VPC and 2 application stacks, and 3 environments (QA, Staging, and Production):

Terraform modules

Terraform modules are self-contained packages of Terraform configurations that are managed as a group. Modules are used to create reusable components, improve organization, and to treat pieces of infrastructure as a black box. In short, they are the Terraform equivalent of functions or reusable code libraries.

Terraform modules come in two flavors:

  1. Internal modules, whose source code is consumed by roots that live in the same repository as the module.
  2. External modules, whose source code is consumed by roots in multiple repositories. The source code for external modules lives in its own repository, separate from any consumers and separate from other modules to ensure we can version the module correctly.

In this post, we’ll only be covering internal modules.

  • Each internal module should be placed within a subdirectory under modules/.
  • Module subdirectories/repositories should follow the standard module structure per the Terraform docs.
  • External modules should always be pinned at a version: a git revision or a version number. This practice allows for reliable and repeatable builds. Failing to pin module versions may cause a module to be updated between builds by breaking the build without any obvious changes in our code. Even worse, failing to pin our module versions might cause a plan to be generated with changes we did not anticipate.

Here’s what our modules directory might look like:

Terraform and Other Tools

Terraform is often used alongside other automation tools within the same repository. Some frequent collaborators include Ansible for configuration management and Packer for compiling identical machine images across multiple virtualization platforms or cloud providers. When using Terraform in conjunction with other tools within the same repo, 2nd Watch creates a directory per tool from the root of the repo:

Putting it all together

The following illustrates a sample Terraform repository structure with all of the concepts outlined above:

Conclusion

There’s no single repository format that’s optimal, but we’ve found that this standard works for the majority of our use cases in our extensive use of Terraform on dozens of projects. That said, if you find a tweak that works better for your organization – go for it! The structure described in this post will give you a solid and battle-tested starting point to keep your Terraform code organized so your team can stay productive.

Additional resources

  • The Terraform Book by James Turnbull provides an excellent introduction to Terraform all the way through repository structure and collaboration techniques.
  • The Hashicorp AWS VPC Module is one of the most popular modules in the Terraform Registry and is an excellent example of a well-written Terraform module.
  • The source code for James Nugent’s Hashidays NYC 2017 talk code is an exemplary Terraform repository. Although it’s based on an older version of Terraform (before providers were broken out from the main Terraform executable), the code structure, formatting, and use of Makefiles is still current.

For help getting started adopting Infrastructure as Code, contact us.

-Josh Kodroff, Associate Cloud Consultant

Facebooktwittergoogle_pluslinkedinmailrss

Corrupted Stolen CPU Time

There is a feature in the Linux Kernel that is relevant to VM’s hosted on Xen servers that is called the “steal percentage.”  When the OS requests from the host system’s use of the CPU and the host CPU is currently tied up with another VM, the Xen server will send an increment to the guest Linux instance which increases the steal percentage.  This is a great feature as it shows exactly how busy the host system is, and it is a feature available on many instances of AWS as they host using Xen.  It is actually said that Netflix will terminate an AWS instance when the steal percentage crosses a certain threshold and start it up again, which will cause the instance to spin up in a new host server as a proactive step to ensure their system is utilizing their resources to the fullest.

What I wanted to discuss here is that it turns out there is a bug in the Linux kernel versions 4.8, 4.9 and 4.10 where the steal percentage can be corrupted during a live migration on the physical Xen server, which causes the CPU utilization to be reported as 100% by the agent.

When looking at Top you will see something like this:

As you can see in the screen shot of Top, the %st metric on the CPU(s) line shows an obviously incorect number.

During a live migration on the physical Xen server, the steal time gets a little out of sync and ends up decrementing the time.  If the time was already at or close to zero, itcauses the time to become negative and, due to type conversions in the code, it causes an overflow.

CloudWatch’s CPU Utilization monitor calculates that utilization by adding the System and User percentages together.  However, this only gives a partial view into your system.  With our agent, we can see what the OS sees.

That is the Steal percentage spiking due to that corruption.  Normally this metric could be monitored and actioned as desired, but with this bug it causes noise and false positives.  If Steal were legitimately high, then the applications on that instance would be running much slower.

There is some discussion online about how to fix this issue, and there are some kernel patches to say “if the steal time is less than zero, just make it zero.”  Eventually this fix will make it through the Linux releases and into the latest OS version, but until then it needs to be dealt with.

We have found that a reboot will clear the corrupted percentage.  The other option is to patch the kernel… which also requires a reboot.  If a reboot is just not possible at the time, the only impact to the system is that it makes monitoring the steal percentage impossible until the number is reset.

It is not a very common issue, but due to the large number of instances we monitor here at 2nd Watch, it is something that we’ve come across frequently enough to investigate in detail and develop a process around.

If you have any questions as to whether or not your servers hosted in the cloud might be effected by this issue, please contact us to discuss how we might be able to help.

-James Brookes, Product Manager

Facebooktwittergoogle_pluslinkedinmailrss

CI/CD for Infrastructure as Code with Terraform and Atlantis

In this post, we’ll go over a complete workflow for continuous integration (CI) and continuous delivery (CD) for infrastructure as code (IaC) with just 2 tools: Terraform, and Atlantis.

What is Terraform?

So what is Terraform? According to the Terraform website:

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.

In practice, this means that Terraform allows you to declare what you want your infrastructure to look like – in any cloud provider – and will automatically determine the changes necessary to make it so. Because of its simple syntax and cross-cloud compatibility, it’s 2nd Watch’s choice for infrastructure as code.

Pain You May Be Experiencing Working With Terraform

When you have multiple collaborators (individ

uals, teams, etc.) working on a Terraform codebase, some common problems are likely to emerge:

  1. Enforcing peer review becomes difficult. In any codebase, you’ll want to ensure that your code is peer reviewed in order to ensure better quality in accordance with The Second Way of DevOps: Feedback. The role of peer review in IaC codebases is even more important. IaC is a powerful tool, but that tool has a double-edge – we are clearly more productive for using it, but that increased productivity also means that a simple typo could potentially cause a major issue with production infrastructure. In order to minimize the potential for bad code to be deployed, you should require peer review on all proposed changes to a codebase (e.g. GitHub Pull Requests with at least one reviewer required). Terraform’s open source offering has no facility to enforce this rule.
  1. Terraform plan output is not easily integrated in code reviews. In all code reviews, you must examine the source code to ensure that your standards are followed, that the code is readable, that it’s reasonably optimized, etc. In this aspect, reviewing Terraform code is like reviewing any other code. However, Terraform code has the unique requirement that you must also examine the effect the code change will have upon your infrastructure (i.e. you must also review the output of a terraform plan command). When you potentially have multiple feature branches in the review process, it becomes critical that you are assured that the terraform plan output is what will be executed when you run terraform apply. If the state of infrastructure changes between a run of terraform plan and a run of terraform apply, the effect of this difference in state could range from inconvenient (the apply fails) to catastrophic (a significant production outage). Terraform itself offers locking capabilities but does not provide an easy way to integrate locking into a peer review process in its open source product.
  1. Too many sets of privileged credentials. Highly-privileged credentials are often required to perform Terraform actions, and the greater the number principals you have with privileged access, the higher your attack surface area becomes. Therefore, from a security standpoint, we’d like to have fewer sets of admin credentials which can potentially be compromised.

What is Atlantis?

And what is Atlantis? Atlantis is an open source tool that allows safe collaboration on Terraform projects by making sure that proposed changes are reviewed and that the proposed change is the actual change which will be executed on your infrastructure. Atlantis is compatible (at the time of writing) with GitHub and Gitlab, so if you’re not using either of these Git hosting systems, you won’t be able to use Atlantis.

How Atlantis Works With Terraform

Atlantis is deployed as a single binary executable with no system-wide dependencies. An operator adds a GitHub or GitLab token for a repository containing Terraform code. The Atlantis installation process then adds hooks to the repository which allows communication to the Atlantis server during the Pull Request process.

You can run Atlantis in a container or a small virtual machine – the only requirement is that the Terraform instance can communicate with both your version control (e.g. GitHub) and infrastructure (e.g. AWS) you’re changing. Once Atlantis is configured for a repository, the typical workflow is:

  1. A developer creates a feature branch in git, makes some changes, and creates a Pull Request (GitHub) or Merge Request (GitLab).
  2. The developer enters atlantis plan in a PR comment.
  3. Via the installed web hooks, Atlantis locally runs terraform plan. If there are no other Pull Requests in progress, Atlantis adds the resulting plan as a comment to the Merge Request.
    • If there are other Pull Requests in progress, the command fails because we can’t ensure that the plan will be valid once applied.
  4. The developer ensures the plan looks good and adds reviewers to the Merge Request.
  5. Once the PR has been approved, the developer enters atlantis apply in a PR comment. This will trigger Atlantis to run terraform apply and the changes will be deployed to your infrastructure.
    • The command will fail if the Merge Request has not been approved.

The following sequence diagram illustrates the sequence of actions described above:

Atlantis sequence diagram

We can see how our pain points in Terraform collaboration are addressed by Atlantis:

  1. In order to enforce code review, you can launch Atlantis with the –require approvals flag: https://github.com/runatlantis/atlantis#approvals
  2. In order to ensure that your terraform plan accurately reflects the change to your infrastructure that will be made when you run terraform apply, Atlantis performs locking on a project or workspace basis: https://github.com/runatlantis/atlantis#locking
  3. In order to prevent creating multiple sets of privileged credentials, you can deploy Atlantis to run on an EC2 instance with a privileged IAM role in its instance profile (e.g. in AWS). In this way, all of your Terraform commands run through a single set of privileged credentials and obviate the need to distribute multiple sets of privileged credentials: https://github.com/runatlantis/atlantis#aws-credentials

Conclusion

You can see that with minimal additional infrastructure you can establish a safe and reliable CI/CD pipeline for your infrastructure as code, enabling you to get more done safely! To find out how you can deploy a CI/CD pipeline in less than 60 days, Contact Us.

-Josh Kodroff, Associate Cloud Consultant

Facebooktwittergoogle_pluslinkedinmailrss

Creating a simple Alexa Skill

Why do it?

Alexa gets a lot of use in our house, and it is very apparent to me that the future is not a touch screen or a mouse, but voice.  Creating an Alexa skill is easy to learn by watching videos and such, but actually creating the skill is a great way to understand the ins and outs of the process and what the backend systems (like AWS Lambda) are capable of.

First you need a problem

To get started, you need a problem to solve.  Once you have the problem, you’ll need to think about the solution before you write a line of code.  What will your skill do?  You need to define the requirements.  For my skill, I wanted to ask Alexa to “park my cloud” and have her stop all EC2 instances or RDS databases in my environment.

Building a solution one word at a time

Now that I’ve defined the problem and have an idea for the requirements of the solution, it’s time to start building the skill.  The first thing you’ll notice is that the Alexa Skill port is not in the standard AWS portal.  You need to go to developer.amazon.com/Alexa and create a developer account and sign in there.  Once inside, there is a lot of good information and videos on creating Alexa skills that are worth reviewing.  Click the “Create Skill” button to get started.  In my example, I’m building a custom skill.

Build

The process for building a skill is broken into major sections; Build, Test, Launch, Measure.  In each one you’ll have a number of things to complete before moving on to the next section.  The major areas of each section are broken down on the left-hand side of the console.  On the initial dashboard you’re also presented with the “Skill builder checklist” on the right as a visual reminder of what you need to do before moving on.

Interaction model

This is the first area you’ll work on in the Build phase of your Alexa skill.  This is setting up how your users will interact with your skill.

Invocation

Invocation will setup how your users will launch your skill.  For simplicity’s sake, this is often just the name of the skill.  The common patterns will be “Alexa, ask [my skill] [some request],” or “Alexa, launch [my skill].”  You’ll want to make sure the invocation for your skill sounds natural to a native speaker.

Intents

I think of intents as the “functions” or “methods” for my Alexa skill.  There are a number of built-in intents that should always be included (Cancel, Help, Stop) as well as your custom intents that will compose the main functionality of your skill.  Here my intent is called “park” since that will have the logic for parking my AWS systems.  The name here will only be exposed to your own code, so it isn’t necessarily important what it is.

Utterances

Utterances is your defined pattern of how people will use your skill.  You’ll want to focus on natural language and normal patterns of speech for native users in your target audience.  I would recommend doing some research and speaking to a diversity of people to get a good cross section of utterances for your skill.  More is better.

Slots

Amazon also provides the option to use slots (variables) in your utterances.  This allows your skill to do things that are dynamic in nature.  When you create a variable in an utterance you also need to create a slot and give it a slot type.  This is like providing a type to a variable in a programming language (Number, String, etc.) and will allow Amazon to understand what to expect when hearing the utterance.  In our simple example, we don’t need any slots.

Interfaces

Interfaces allow you to interface your skill with other services to provide audio, display, or video options.  These aren’t needed for a simple skill, so you can skip it.

Endpoint

Here’s where you’ll connect your Alexa skill to the endpoint you want to handle the logic for your skill.  The easiest setup is to use AWS Lambda.  There are lots of example Lambda blueprints using different programming languages and doing different things.  Use those to get started because the json response formatting can be difficult otherwise.  If you don’t have an Alexa skill id here, you’ll need to Save and Build your skill first.  Then a skill id will be generated, and you can use it when configuring your Lambda triggers.

AWS Account Lambda

Assuming you already have an AWS account, you’ll want to deploy a new Lambda from a blueprint that looks somewhat similar to what you’re trying to accomplish with your skill (deployed in US-East-1).  Even if nothing matches well, pick any one of them as they have the json return formatting set up so you can use it in your code.  This will save you a lot of time and effort.  Take a look at the information here and here for more information about how to setup and deploy Lambda for Alexa skills.  You’ll want to configure your Alexa skill as the trigger for the Lambda in the configuration, and here’s where you’ll copy in your skill id from the developer console “Endpoints” area of the Build phase.

While the actual coding of the Lambda isn’t the purpose of the article, I will include a couple of highlights that are worth mentioning.  Below, see the part of the code from the AWS template that would block the Lambda from being run by any Alexa skill other than my own.  While the chances of this are rare, there’s no reason for my Lambda to be open to everyone.  Here’s what that code looks like in Python:

if (event[‘session’][‘application’][‘applicationId’] != “amzn1.ask.skill.000000000000000000000000”):

raise ValueError(“Invalid Application ID”)

Quite simply, if the Alexa application id passed in the session doesn’t match my known Alexa skill id, then raise an error.  The other piece of advice I’d give about the Lambda is to create different methods for each intent to keep the logic separated and easy to follow.  Make sure you remove any response language from your code that is from the original blueprint.  If your responses are inconsistent, Amazon will fail your skill (I had this happen multiple times because I borrowed from the “Color Picker” Lambda blueprint and had some generic responses left in the code).  Also, you’ll want to handle your Cancel, Help, and Stop requests correctly.  Lastly, as best practice in all code, add copious logging to CloudWatch so you can diagnose issues.  Note the ARN of your Lambda function as you’ll need it for configuring the endpoints in the developer portal.

Test

Once your Lambda is deployed in AWS, you can go back into the developer portal and begin testing the skill.  First, put your Lambda function ARN into the endpoint configuration for your skill.  Next, click over to the Test phase at the top and choose “Alexa Simulator.”  You can try recording your voice on your computer microphone or typing in the request.  I recommend you do both to get a sense of how Alexa will interpret what you say and respond.  Note that I’ve found the actual Alexa is better at natural language processing than the test options using a microphone on my laptop.  When you do a test, the console will show you the JSON input and output.  You can take this INPUT pane and copy that information to build a Lambda test script on your Lambda function.  If you need to do a lot of work on your Lambda, it’s a lot easier to test from there than to flip back and forth.  Pay special attention to your utterances.  You’ll learn quickly that your proposed utterances weren’t as natural as you thought.  Make updates to the utterances and Lambda as needed and keep testing.

Launch

Once you have your skill in a place where it’s viable, it’s time to launch.  Under “Skill Preview,” you’ll pick names, icons, and a category for your skill.  You’ll also create “cards” that people will see in the Alexa app that explain how to use your skill and the kinds of things you can say.  Lastly, you’ll need a URL to your “Privacy Policy” and “Terms of Use.”  Under “Privacy and Compliance,” you answer more questions about what kind of information your skill collects, whether it targets children or contains advertising, if it’s ok to export, etc.  Note that if you categorize your skill as a “Game,” it will automatically be considered to be directed at children, so you may want to avoid that with your first skill if you collect any identifiable information. Under “Availability,” you’ll decide if the skill should be public and what countries you want to be able to access the skill.  The last section is “Submission,” where you hopefully get the green checkmark and be allowed to submit.

Skill Certification

Now you wait.  Amazon seems to have a number of automated processes that catch glaring issues, but you will likely end up with some back and forth between yourself and an Amazon employee regarding some part of your skill that needs to be updated.  It took about a week to get my final approval and my skill posted.

Conclusion

Creating your own simple Alexa skill is a fun and easy way to get some experience creating applications that respond to voice and understand what’s possible on the platform.  Good luck!

-Coin Graham, Senior Cloud Consultant

Facebooktwittergoogle_pluslinkedinmailrss