1-888-317-7920 info@2ndwatch.com

Fully Coded And Automated CI/CD Pipelines: The Weeds

The Why

In my last post we went over why we’d want to go the CI/CD/Automated route and the more cultural reasons of why it is so beneficial. In this post, we’re going to delve a little bit deeper and examine the technical side of tooling. Remember, a primary point of doing a release is mitigating risk. CI/CD is all about mitigating risk… fast.

There’s a Process

The previous article noted that you can’t do CI/CD without building on a set of steps, and I’m going to take this approach here as well. Unsurprisingly, we’ll follow the steps we laid out in the “Why” article, and tackle each in turn.

Step I: Automated Testing

You must automate your testing. There is no other way to describe this. In this particular step however, we can concentrate on unit testing: Testing the small chunks of code you produce (usually functions or methods). There’s some chatter about TDD (Test Driven Development) vs BDD (Behavior Driven Development) in the development community, but I don’t think it really matters, just so long as you are writing test code along side your production code. On our team, we prefer the BDD style testing paradigm. I’ve always liked the symantically descriptive nature of BDD testing over strictly code-driven ones. However, it should be said that both are effective and any is better than none, so this is more of a personal preference. On our team we’ve been coding in golang, and our BDD framework of choice is the Ginkgo/Gomega combo.

Here’s a snippet of one of our tests that’s not entirely simple:

Describe("IsValidFormat", func() {
  for _, check := range AvailableFormats {
    Context("when checking "+check, func() {
      It("should return true", func() {
        Ω(IsValidFormat(check)).To(BeTrue())
      })
    })
  }
 
  Context("when checking foo", func() {
    It("should return false", func() {
      Ω(IsValidFormat("foo")).To(BeFalse())
    })
  })
)

So as you can see, the Ginkgo (ie: BDD) formatting is pretty descriptive about what’s happening. I can instantly understand what’s expected. The function IsValidFormat, should return true given the range (list) of AvailableFormats. A format of foo (which is not a valid format) should return false. It’s both tested and understandable to the future change agent (me or someone else).

Step II: Continuous Integration

Continuous Integration takes Step 1 further, in that it brings all the changes to your codebase to a singular point, and building an object for deployment. This means you’ll need an external system to automatically handle merges / pushes. We use Jenkins as our automation server, running it in Kubernetes using the Pipeline style of job description. I’ll get into the way we do our builds using Make in a bit, but the fact we can include our build code in with our projects is a huge win.

Here’s a (modified) Jenkinsfile we use for one of our CI jobs:

def notifyFailed() {
  slackSend (color: '#FF0000', message: "FAILED: '${env.JOB_NAME} [${env.BUILD_NUMBER}]' (${env.BUILD_URL})")
}
 
podTemplate(
  label: 'fooProject-build',
  containers: [
    containerTemplate(
      name: 'jnlp',
      image: 'some.link.to.a.container:latest',
      args: '${computer.jnlpmac} ${computer.name}',
      alwaysPullImage: true,
    ),
    containerTemplate(
      name: 'image-builder',
      image: 'some.link.to.another.container:latest',
      ttyEnabled: true,
      alwaysPullImage: true,
      command: 'cat'
    ),
  ],
  volumes: [
    hostPathVolume(
      hostPath: '/var/run/docker.sock',
      mountPath: '/var/run/docker.sock'
    ),
    hostPathVolume(
      hostPath: '/home/jenkins/workspace/fooProject',
      mountPath: '/home/jenkins/workspace/fooProject'
    ),
    secretVolume(
      secretName: 'jenkins-creds-for-aws',
      mountPath: '/home/jenkins/.aws-jenkins'
    ),
    hostPathVolume(
      hostPath: '/home/jenkins/.aws',
      mountPath: '/home/jenkins/.aws'
    )
  ]
)
{
  node ('fooProject-build') {
    try {
      checkout scm
 
      wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'XTerm']) {
        container('image-builder'){
          stage('Prep') {
            sh '''
              cp /home/jenkins/.aws-jenkins/config /home/jenkins/.aws/.
              cp /home/jenkins/.aws-jenkins/credentials /home/jenkins/.aws/.
              make get_images
            '''
          }
 
          stage('Unit Test'){
            sh '''
              make test
              make profile
            '''
          }
 
          step([
            $class:              'CoberturaPublisher',
            autoUpdateHealth:    false,
            autoUpdateStability: false,
            coberturaReportFile: 'report.xml',
            failUnhealthy:       false,
            failUnstable:        false,
            maxNumberOfBuilds:   0,
            sourceEncoding:      'ASCII',
            zoomCoverageChart:   false
          ])
 
          stage('Build and Push Container'){
            sh '''
              make push
            '''
          }
        }
      }
 
      stage('Integration'){
        container('image-builder') {
          sh '''
            make deploy_integration
            make toggle_integration_service
          '''
        }
        try {
          wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'XTerm']) {
            container('image-builder') {
              sh '''
                sleep 45
                export KUBE_INTEGRATION=https://fooProject-integration
                export SKIP_TEST_SERVER=true
                make integration
              '''
            }
          }
        } catch(e) {
          container('image-builder'){
            sh '''
              make clean
            '''
          }
          throw(e)
        }
      }
 
      stage('Deploy to Production'){
        container('image-builder') {
          sh '''
            make clean
            make deploy_dev
          '''
        }
      }
    } catch(e) {
      container('image-builder'){
        sh '''
          make clean
        '''
      }
      currentBuild.result = 'FAILED'
      notifyFailed()
      throw(e)
    }
  }
}

There’s a lot going on here, but the important part to notice is that I grabbed this from the project repo. The build instructions are included with the project itself. It’s creating an artifact, running our tests, etc. But it’s all part of our project code base. It’s checked into git. It’s code like all the other code we mess with. The steps are somewhat inconsequential for this level of topic, but it works. We also have it setup to run when there’s a push to github (AND nightly). This ensures that we are continuously running this build and integrating everything that’s happened to the repo in a day. It helps us keep on top of all the possible changes to the repo as well as our environment.

Hey… what’s all that make_ crap?_

Make

Our team uses a lot of tools. We ascribe to the maxim: Use what’s best for the particular situation. I can’t remember every tool we use. Neither can my teammates. Neither can 90% of the people that “do the devops.” I’ve heard a lot of folks say, “No! We must solidify on our toolset!” Let your teams use what they need to get the job done the right way. Now, the fear of experiencing tool “overload” seems like a legitimate one in this scenario, but the problem isn’t the number of tools… it’s how you manage and use use them.

Enter Makefiles! (aka: make)

Make has been a mainstay in the UNIX world for a long time (especially in the C world). It is a build tool that’s utilized to help satisfy dependencies, create system-specific configurations, and compile code from various sources independent of platform. This is fantastic, except, we couldn’t care less about that in the context of our CI/CD Pipelines. We use it because it’s great at running “buildy” commands.

Make is our unifier. It links our Jenkins CI/CD build functionality with our Dev functionality. Specifically, opening up the docker port here in the Jenkinsfile:

volumes: [
  hostPathVolume(
    hostPath: '/var/run/docker.sock',
    mountPath: '/var/run/docker.sock'
  ),

…allows us to run THE SAME COMMANDS WHEN WE’RE DEVELOPING AS WE DO IN OUR CI/CD PROCESS. This socket allows us to run containers from containers, and since Jenkins is running on a container, this allows us to run our toolset containers in Jenkins, using the same commands we’d use in our local dev environment. On our local dev machines, we use docker nearly exclusively as a wrapper to our tools. This ensures we have library, version, and platform consistency on all of our dev environments as well as our build system. We use containers for our prod microservices so production is part of that “chain of consistency” as well. It ensures that we see consistent behavior across the horizon of application development through production. It’s a beautiful thing! We use the Makefile as the means to consistently interface with the docker “tool” across differing environments.

Ok, I know your interest is peaked at this point. (Or at least I really hope it is!)
So here’s a generic makefile we use for many of our projects:

CONTAINER=$(shell basename $$PWD | sed -E 's/^ia-image-//')
.PHONY: install install_exe install_test_exe deploy test
 
install:
    docker pull sweet.path.to.a.repo/$(CONTAINER)
    docker tag sweet.path.to.a.repo/$(CONTAINER):latest $(CONTAINER):latest
 
install_exe:
    if [[ ! -d $(HOME)/bin ]]; then mkdir -p $(HOME)/bin; fi
    echo "docker run -itP -v \$$PWD:/root $(CONTAINER) \"\$$@\"" > $(HOME)/bin/$(CONTAINER)
    chmod u+x $(HOME)/bin/$(CONTAINER)
 
install_test_exe:
    if [[ ! -d $(HOME)/bin ]]; then mkdir -p $(HOME)/bin; fi
    echo "docker run -itP -v \$$PWD:/root $(CONTAINER)-test \"\$$@\"" > $(HOME)/bin/$(CONTAINER)
    chmod u+x $(HOME)/bin/$(CONTAINER)
 
test:
    docker build -t $(CONTAINER)-test .
 
deploy:
    captain push

This is a Makefile we use to build our tooling images. It’s much simpler than our project Makefiles, but I think this illustrates how you can use Make to wrap EVERYTHING you use in your development workflow. This also allows us to settle on similar/consistent terminology between different projects. %> make test? That’ll run the tests regardless if we are working on a golang project or a python lambda project, or in this case, building a test container, and tagging it as whatever-test. Make unifies “all the things.”

This also codifies how to execute the commands. ie: what arguments to pass, what inputs etc. If I can’t even remember the name of the command, I’m not going to remember the arguments. To remedy, I just open up the Makefile, and I can instantly see.

Step III: Continuous Deployment

After the last post (you read it right?), some might have noticed that I skipped the “Delivery” portion of the “CD” pipeline. As far as I’m concerned, there is no “Delivery” in a “Deployment” pipeline. The “Delivery” is the actual deployment of your artifact. Since the ultimate goal should be Depoloyment, I’ve just skipped over that intermediate step.

Okay, sure, if you want to hold off on deploying automatically to Prod, then have that gate. But Dev, Int, QA, etc? Deployment to those non-prod environments should be automated just like the rest of your code.

If you guessed we use make to deploy our code, you’d be right! We put all our deployment code with the project itself, just like the rest of the code concerning that particular object. For services, we use a Dockerfile that describes the service container and several yaml files (e.g. deployment_<env>.yaml) that describe the configurations (e.g. ingress, services, deployments) we use to configure and deploy to our Kubernetes cluster.

Here’s an example:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: sweet-aws-service
    stage: dev
  name: sweet-aws-service-dev
  namespace: sweet-service-namespace
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: sweet-aws-service
      name: sweet-aws-service
    spec:
      containers:
      - name: sweet-aws-service
        image: path.to.repo.for/sweet-aws-service:latest
        imagePullPolicy: Always
        env:
          - name: PORT
            value: "50000"
          - name: TLS_KEY
            valueFrom:
              secretKeyRef:
                name: grpc-tls
                key: key
          - name: TLS_CERT
            valueFrom:
              secretKeyRef:
                name: grpc-tls
                key: cert

This is an example of a deployment into Kubernetes for dev. That %> make deploy_dev from the Jenkinsfile above? That’s pushing this to our Kubernetes cluster.

Conclusion

There is a lot of information to take in here, but there are two points to really take home:

  1. It is totally possible.
  2. Use a unifying tool to… unify your tools. (“one tool to rule them all”)

For us, Point 1 is moot… it’s what we do. For Point 2, we use Make, and we use Make THROUGH THE ENTIRE PROCESS. I use Make locally in dev and on our build server. It ensures we’re using the same commands, the same containers, the same tools to do the same things. Test, integrate (test), and deploy. It’s not just about writing functional code anymore. It’s about writing a functional process to get that code, that value, to your customers!

And remember, as with anything, this stuff get’s easier with practice. So once you start doing it you will get the hang of it and life becomes easier and better. If you’d like some help getting started, download our datasheet to learn about our Modern CI/CD Pipeline.

-Craig Monson, Sr Automation Architect

 

Facebooktwittergoogle_pluslinkedinmailrss

How We Organize Terraform Code at 2nd Watch

When IT organizations adopt infrastructure as code (IaC), the benefits in productivity, quality, and ability to function at scale are manifold. However, the first few steps on the journey to full automation and immutable infrastructure bliss can be a major disruption to a more traditional IT operations team’s established ways of working. One of the common problems faced in adopting infrastructure as code is how to structure the files within a repository in a consistent, intuitive, and scaleable manner. Even IT operations teams whose members have development skills will still face this anxiety-inducing challenge simply because adopting IaC involves new tools whose conventions differ somewhat from more familiar languages and frameworks.

In this blog post, we’ll go over how we structure our IaC repositories within 2nd Watch professional services and managed services engagements with a particular focus on Terraform, an open-source tool by Hashicorp for provisioning infrastructure across multiple cloud providers with a single interface.

First Things First: README.md and .gitignore

The task in any new repository is to create a README file. Many git repositories (especially on Github) have adopted Markdown as a de facto standard format for README files. A good README file will include the following information:

  1. Overview: A brief description of the infrastructure the repo builds. A high-level diagram is often an effective method of expressing this information. 2nd Watch uses LucidChart for general diagrams (exported to PNG or a similar format) and mscgen_js for sequence diagrams.
  2. Pre-requisites: Installation instructions (or links thereto) for any software that must be installed before building or changing the code.
  3. Building The Code: What commands to run in order to build the infrastructure and/or run the tests when applicable. 2nd Watch uses Make in order to provide a single tool with a consistent interface to build all codebases, regardless of language or toolset. If using Make in Windows environments, Windows Subsystem for Linux is recommended for Windows 10 in order to avoid having to write two sets of commands in Makefiles: Bash, and PowerShell.

It’s important that you do not neglect this basic documentation for two reasons (even if you think you’re the only one who will work on the codebase):

  1. The obvious: Writing this critical information down in an easily viewable place makes it easier for other members of your organization to onboard onto your project and will prevent the need for a panicked knowledge transfer when projects change hands.
  2. The not-so-obvious: The act of writing a description of the design clarifies your intent to yourself and will result in a cleaner design and a more coherent repository.

All repositories should also include a .gitignore file with the appropriate settings for Terraform. GitHub’s default Terraform .gitignore is a decent starting point, but in most cases you will not want to ignore .tfvars files because they often contain environment-specific parameters that allow for greater code reuse as we will see later.

Terraform Roots and Multiple Environments

A Terraform root is the unit of work for a single terraform apply command. We group our infrastructure into multiple terraform roots in order to limit our “blast radius” (the amount of damage a single errant terraform apply can cause).

  • Repositories with multiple roots should contain a roots/ directory with a subdirectory for each root (e.g. VPC, one per-application) tf file as the primary entry point.
  • Note that the roots/ directory is optional for repositories that only contain a single root, e.g. infrastructure for an application team which includes only a few resources which should be deployed in concert. In this case, modules/ may be placed in the same directory as tf.
  • Roots which are deployed into multiple environments should include an env/ subdirectory at the same level as tf. Each environment corresponds to a tfvars file under env/ named after the environment, e.g. staging.tfvars. Each .tfvars file contains parameters appropriate for each environment, e.g. EC2 instance sizes.

Here’s what our roots directory might look like for a sample with a VPC and 2 application stacks, and 3 environments (QA, Staging, and Production):

Terraform modules

Terraform modules are self-contained packages of Terraform configurations that are managed as a group. Modules are used to create reusable components, improve organization, and to treat pieces of infrastructure as a black box. In short, they are the Terraform equivalent of functions or reusable code libraries.

Terraform modules come in two flavors:

  1. Internal modules, whose source code is consumed by roots that live in the same repository as the module.
  2. External modules, whose source code is consumed by roots in multiple repositories. The source code for external modules lives in its own repository, separate from any consumers and separate from other modules to ensure we can version the module correctly.

In this post, we’ll only be covering internal modules.

  • Each internal module should be placed within a subdirectory under modules/.
  • Module subdirectories/repositories should follow the standard module structure per the Terraform docs.
  • External modules should always be pinned at a version: a git revision or a version number. This practice allows for reliable and repeatable builds. Failing to pin module versions may cause a module to be updated between builds by breaking the build without any obvious changes in our code. Even worse, failing to pin our module versions might cause a plan to be generated with changes we did not anticipate.

Here’s what our modules directory might look like:

Terraform and Other Tools

Terraform is often used alongside other automation tools within the same repository. Some frequent collaborators include Ansible for configuration management and Packer for compiling identical machine images across multiple virtualization platforms or cloud providers. When using Terraform in conjunction with other tools within the same repo, 2nd Watch creates a directory per tool from the root of the repo:

Putting it all together

The following illustrates a sample Terraform repository structure with all of the concepts outlined above:

Conclusion

There’s no single repository format that’s optimal, but we’ve found that this standard works for the majority of our use cases in our extensive use of Terraform on dozens of projects. That said, if you find a tweak that works better for your organization – go for it! The structure described in this post will give you a solid and battle-tested starting point to keep your Terraform code organized so your team can stay productive.

Additional resources

  • The Terraform Book by James Turnbull provides an excellent introduction to Terraform all the way through repository structure and collaboration techniques.
  • The Hashicorp AWS VPC Module is one of the most popular modules in the Terraform Registry and is an excellent example of a well-written Terraform module.
  • The source code for James Nugent’s Hashidays NYC 2017 talk code is an exemplary Terraform repository. Although it’s based on an older version of Terraform (before providers were broken out from the main Terraform executable), the code structure, formatting, and use of Makefiles is still current.

For help getting started adopting Infrastructure as Code, contact us.

-Josh Kodroff, Associate Cloud Consultant

Facebooktwittergoogle_pluslinkedinmailrss

How to Use AWS IAM with STS for access to AWS resources

With increased focus on security and governance in today’s digital economy, I want to highlight a simple but important use case that demonstrates how to use AWS Identity and Access Management (IAM) with Security Token Service (STS) to give trusted AWS accounts access to resources that you control and manage.

Security Token Service is an extension of IAM and is one of several web services offered by AWS that does not incur any costs to use.  But, unlike IAM, there is no user interface on the AWS console to manage and interact with STS. Rather all interaction is done entirely through one of several extensive SDKs or directly using common HTTP protocol.  I will be using Terraform to create some simple resources in my sandbox account and .NET Core SDK to demonstrate how to interact with STS.

The main purpose and function of STS is to issue temporary security credentials for AWS resources to trusted and authenticated entities.  These credentials operate identically to the long-term keys that typical IAM users have, with a couple of special characteristics:

  • They automatically expire and become unusable after a short and defined period of time elapses
  • They are issued dynamically

These characteristics offer several advantages in terms of application security and development and are useful for cross-account delegation and access.  STS solves two problems for owners of AWS resources:

  • Meets the IAM best-practices requirement to regularly rotate access keys
  • You do not need to distribute access keys to external entities or store them within an application

One common scenario where STS is useful involves sharing resources between AWS accounts.  Let’s say, for example, that your organization captures and processes data in S3, and one of your clients would like to push large amounts of data from resources in their AWS account to an S3 bucket in your account in an automated and secure fashion.

While you could create an IAM user for your client, your corporate data policy requires that you rotate access keys on a regular basis, and this introduces challenges for automated processes.  Additionally, you would like to limit the distribution of access keys to your resources to external entities.  Let’s use STS to solve this!

To get started, let’s create some resources in your AWS cloud.  Do you even Terraform, bro?

Let’s create a new S3 bucket and set the bucket ACL to be private, meaning nobody but the bucket owner (that’s you!) has access.  Remember that bucket names must be unique across all existing buckets, and they should comply with DNS naming conventions.  Here is the Terraform HCL syntax to do this:

Great! We now have a bucket… but for now, only the owner can access it.  This is a good start from a security perspective (i.e. “least permissive” access).

What an empty bucket may look like

Let’s create an IAM role that, once assumed, will allow IAM users with access to this role to have permissions to put objects into our bucket.  Roles are a secure way to grant trusted entities access to your resources.  You can think about roles in terms of a jacket that an IAM user can wear for a short period of time, and while wearing this jacket, the user has privileges that they wouldn’t normally have when they aren’t wearing it.  Kind of like a bright yellow Event Staff windbreaker!

For this role, we will specify that users from our client’s AWS account are the only ones that can wear the jacket. This is done by including the client’s AWS account ID in the Principal statement.  AWS Account IDs are not considered to be secret, so your client can share this with you without compromising their security.  If you don’t have a client but still want to try this stuff out, put your own AWS account ID here instead.


Great, now we have a role that our trusted client can wear.  But, right now our client can’t do anything except wear the jacket.  Let’s give the jacket some special powers, such that anyone wearing it can put objects into our S3 bucket.  We will do this by creating a security policy for this role.  This policy will specify what exactly can be done to S3 buckets that it is attached to. Then we will attach it to the bucket we want our client to use.  Here is the Terraform syntax to accomplish this:

A couple things to note about this snippet – First, we are using Terraform interpolation to inject values from previous terraform statements into a couple of places in the policy – specifically the ARN from the role and bucket we created previously.  Second, we are specifying a condition for the s3 policy – one that requires a specific object ACL for the action s3:PutObject, which is accomplished by including the HTTP request header x-amz-acl to have a value of bucket-owner-full-control with the PUT object request.  By default, objects PUT in S3 are owned by the account that created them, even if it is stored in someone else’s bucket.  For our scenario, this condition will require your client to explicitly grant ownership of objects placed in your bucket to you, otherwise the PUT request will fail.

So, now we have a bucket, a policy in place on our bucket, and a role that assumes that policy.  Now your client needs to get to work writing some code that will allow them to assume the role (wear the jacket) and start putting objects into your bucket.  Your client will need to know a couple of things from you before they get started:

  1. The bucket name and the region it was created in (the example above created a bucket named d4h2123b9-xaccount-bucket in us-west-2)
  2. The ARN for the role (Terraform can output this for you). It will look something like this but will have your actual AWS Account ID: arn:aws:iam::123456789012:role/sts-delegate-role

They will also need to create an IAM User in their account and attach a policy allowing the user to assume roles via STS.  The policy will look similar to this:


Let’s help your client out a bit and provide some C# code snippets for .NET Core 2.0 (available for Windows, macOS and LinuxTo get started, install the .NET SDK for your OS, then fire up a command prompt in a favorite directory and run these commands:

The first command will create a new console app in the subdirectory s3cli.  Then switch context to that directory and import the AWS SDK for .NET Core, and then add packages for SecurityToken and S3 services.
Once you have the libraries in place, fire up your favorite IDE or text editor (I use Visual Studio Code), then open Program.cs and add some code:

This snippet sends a request to STS for temporary credentials using the specified ARN.  Note that the client must provide IAM user credentials to call STS, and that IAM user must have a policy applied that allows it to assume a role from STS.

This next snippet takes the STS credentials, bucket name, and region name, and then uploads the Program.cs file that you’re editing and assigns it a random key/name.  Also note that it explicitly applies the Canned ACL that is required by the sts-delegate-role:

So, to put this all together, run this code block and make the magic happen!  Of course, you will have to define and provide proper variable values for your environment, including  securely storing your credentials.

Try it out from the command prompt:

If all goes well, you will have a copy of Program.cs in the bucket. Not very useful itself, but it illustrates how to accomplish the task.

What a bucket with something in it may look like

Here is a high-level document of what we put together:

Putting it all together

Steps:

  1. Your client uses their IAM user to call AWS STS and requests the role ARN you gave them
  2. STS authenticates the client’s IAM user and verifies the policy for the ARN role, then issues a temporary credential to the client.
  3. The client can use the temporary credentials to access your S3 bucket (they will expire soon), and since they are now wearing the Event Staff jacket, they can successfully PUT stuff in your bucket!

There are many other use-cases for STS. This is just one very simplistic example. However, with this brief introduction to the concepts, you should now have a decent idea of how STS works with IAM roles and policies, and how you can use STS to give access to your AWS resources for trusted entities. For more tips like this, contact us.

-Jonathan Eropkin, Cloud Consultant

Facebooktwittergoogle_pluslinkedinmailrss

Corrupted Stolen CPU Time

There is a feature in the Linux Kernel that is relevant to VM’s hosted on Xen servers that is called the “steal percentage.”  When the OS requests from the host system’s use of the CPU and the host CPU is currently tied up with another VM, the Xen server will send an increment to the guest Linux instance which increases the steal percentage.  This is a great feature as it shows exactly how busy the host system is, and it is a feature available on many instances of AWS as they host using Xen.  It is actually said that Netflix will terminate an AWS instance when the steal percentage crosses a certain threshold and start it up again, which will cause the instance to spin up in a new host server as a proactive step to ensure their system is utilizing their resources to the fullest.

What I wanted to discuss here is that it turns out there is a bug in the Linux kernel versions 4.8, 4.9 and 4.10 where the steal percentage can be corrupted during a live migration on the physical Xen server, which causes the CPU utilization to be reported as 100% by the agent.

When looking at Top you will see something like this:

As you can see in the screen shot of Top, the %st metric on the CPU(s) line shows an obviously incorect number.

During a live migration on the physical Xen server, the steal time gets a little out of sync and ends up decrementing the time.  If the time was already at or close to zero, itcauses the time to become negative and, due to type conversions in the code, it causes an overflow.

CloudWatch’s CPU Utilization monitor calculates that utilization by adding the System and User percentages together.  However, this only gives a partial view into your system.  With our agent, we can see what the OS sees.

That is the Steal percentage spiking due to that corruption.  Normally this metric could be monitored and actioned as desired, but with this bug it causes noise and false positives.  If Steal were legitimately high, then the applications on that instance would be running much slower.

There is some discussion online about how to fix this issue, and there are some kernel patches to say “if the steal time is less than zero, just make it zero.”  Eventually this fix will make it through the Linux releases and into the latest OS version, but until then it needs to be dealt with.

We have found that a reboot will clear the corrupted percentage.  The other option is to patch the kernel… which also requires a reboot.  If a reboot is just not possible at the time, the only impact to the system is that it makes monitoring the steal percentage impossible until the number is reset.

It is not a very common issue, but due to the large number of instances we monitor here at 2nd Watch, it is something that we’ve come across frequently enough to investigate in detail and develop a process around.

If you have any questions as to whether or not your servers hosted in the cloud might be effected by this issue, please contact us to discuss how we might be able to help.

-James Brookes, Product Manager

Facebooktwittergoogle_pluslinkedinmailrss

Creating a simple Alexa Skill

Why do it?

Alexa gets a lot of use in our house, and it is very apparent to me that the future is not a touch screen or a mouse, but voice.  Creating an Alexa skill is easy to learn by watching videos and such, but actually creating the skill is a great way to understand the ins and outs of the process and what the backend systems (like AWS Lambda) are capable of.

First you need a problem

To get started, you need a problem to solve.  Once you have the problem, you’ll need to think about the solution before you write a line of code.  What will your skill do?  You need to define the requirements.  For my skill, I wanted to ask Alexa to “park my cloud” and have her stop all EC2 instances or RDS databases in my environment.

Building a solution one word at a time

Now that I’ve defined the problem and have an idea for the requirements of the solution, it’s time to start building the skill.  The first thing you’ll notice is that the Alexa Skill port is not in the standard AWS portal.  You need to go to developer.amazon.com/Alexa and create a developer account and sign in there.  Once inside, there is a lot of good information and videos on creating Alexa skills that are worth reviewing.  Click the “Create Skill” button to get started.  In my example, I’m building a custom skill.

Build

The process for building a skill is broken into major sections; Build, Test, Launch, Measure.  In each one you’ll have a number of things to complete before moving on to the next section.  The major areas of each section are broken down on the left-hand side of the console.  On the initial dashboard you’re also presented with the “Skill builder checklist” on the right as a visual reminder of what you need to do before moving on.

Interaction model

This is the first area you’ll work on in the Build phase of your Alexa skill.  This is setting up how your users will interact with your skill.

Invocation

Invocation will setup how your users will launch your skill.  For simplicity’s sake, this is often just the name of the skill.  The common patterns will be “Alexa, ask [my skill] [some request],” or “Alexa, launch [my skill].”  You’ll want to make sure the invocation for your skill sounds natural to a native speaker.

Intents

I think of intents as the “functions” or “methods” for my Alexa skill.  There are a number of built-in intents that should always be included (Cancel, Help, Stop) as well as your custom intents that will compose the main functionality of your skill.  Here my intent is called “park” since that will have the logic for parking my AWS systems.  The name here will only be exposed to your own code, so it isn’t necessarily important what it is.

Utterances

Utterances is your defined pattern of how people will use your skill.  You’ll want to focus on natural language and normal patterns of speech for native users in your target audience.  I would recommend doing some research and speaking to a diversity of people to get a good cross section of utterances for your skill.  More is better.

Slots

Amazon also provides the option to use slots (variables) in your utterances.  This allows your skill to do things that are dynamic in nature.  When you create a variable in an utterance you also need to create a slot and give it a slot type.  This is like providing a type to a variable in a programming language (Number, String, etc.) and will allow Amazon to understand what to expect when hearing the utterance.  In our simple example, we don’t need any slots.

Interfaces

Interfaces allow you to interface your skill with other services to provide audio, display, or video options.  These aren’t needed for a simple skill, so you can skip it.

Endpoint

Here’s where you’ll connect your Alexa skill to the endpoint you want to handle the logic for your skill.  The easiest setup is to use AWS Lambda.  There are lots of example Lambda blueprints using different programming languages and doing different things.  Use those to get started because the json response formatting can be difficult otherwise.  If you don’t have an Alexa skill id here, you’ll need to Save and Build your skill first.  Then a skill id will be generated, and you can use it when configuring your Lambda triggers.

AWS Account Lambda

Assuming you already have an AWS account, you’ll want to deploy a new Lambda from a blueprint that looks somewhat similar to what you’re trying to accomplish with your skill (deployed in US-East-1).  Even if nothing matches well, pick any one of them as they have the json return formatting set up so you can use it in your code.  This will save you a lot of time and effort.  Take a look at the information here and here for more information about how to setup and deploy Lambda for Alexa skills.  You’ll want to configure your Alexa skill as the trigger for the Lambda in the configuration, and here’s where you’ll copy in your skill id from the developer console “Endpoints” area of the Build phase.

While the actual coding of the Lambda isn’t the purpose of the article, I will include a couple of highlights that are worth mentioning.  Below, see the part of the code from the AWS template that would block the Lambda from being run by any Alexa skill other than my own.  While the chances of this are rare, there’s no reason for my Lambda to be open to everyone.  Here’s what that code looks like in Python:

if (event[‘session’][‘application’][‘applicationId’] != “amzn1.ask.skill.000000000000000000000000”):

raise ValueError(“Invalid Application ID”)

Quite simply, if the Alexa application id passed in the session doesn’t match my known Alexa skill id, then raise an error.  The other piece of advice I’d give about the Lambda is to create different methods for each intent to keep the logic separated and easy to follow.  Make sure you remove any response language from your code that is from the original blueprint.  If your responses are inconsistent, Amazon will fail your skill (I had this happen multiple times because I borrowed from the “Color Picker” Lambda blueprint and had some generic responses left in the code).  Also, you’ll want to handle your Cancel, Help, and Stop requests correctly.  Lastly, as best practice in all code, add copious logging to CloudWatch so you can diagnose issues.  Note the ARN of your Lambda function as you’ll need it for configuring the endpoints in the developer portal.

Test

Once your Lambda is deployed in AWS, you can go back into the developer portal and begin testing the skill.  First, put your Lambda function ARN into the endpoint configuration for your skill.  Next, click over to the Test phase at the top and choose “Alexa Simulator.”  You can try recording your voice on your computer microphone or typing in the request.  I recommend you do both to get a sense of how Alexa will interpret what you say and respond.  Note that I’ve found the actual Alexa is better at natural language processing than the test options using a microphone on my laptop.  When you do a test, the console will show you the JSON input and output.  You can take this INPUT pane and copy that information to build a Lambda test script on your Lambda function.  If you need to do a lot of work on your Lambda, it’s a lot easier to test from there than to flip back and forth.  Pay special attention to your utterances.  You’ll learn quickly that your proposed utterances weren’t as natural as you thought.  Make updates to the utterances and Lambda as needed and keep testing.

Launch

Once you have your skill in a place where it’s viable, it’s time to launch.  Under “Skill Preview,” you’ll pick names, icons, and a category for your skill.  You’ll also create “cards” that people will see in the Alexa app that explain how to use your skill and the kinds of things you can say.  Lastly, you’ll need a URL to your “Privacy Policy” and “Terms of Use.”  Under “Privacy and Compliance,” you answer more questions about what kind of information your skill collects, whether it targets children or contains advertising, if it’s ok to export, etc.  Note that if you categorize your skill as a “Game,” it will automatically be considered to be directed at children, so you may want to avoid that with your first skill if you collect any identifiable information. Under “Availability,” you’ll decide if the skill should be public and what countries you want to be able to access the skill.  The last section is “Submission,” where you hopefully get the green checkmark and be allowed to submit.

Skill Certification

Now you wait.  Amazon seems to have a number of automated processes that catch glaring issues, but you will likely end up with some back and forth between yourself and an Amazon employee regarding some part of your skill that needs to be updated.  It took about a week to get my final approval and my skill posted.

Conclusion

Creating your own simple Alexa skill is a fun and easy way to get some experience creating applications that respond to voice and understand what’s possible on the platform.  Good luck!

-Coin Graham, Senior Cloud Consultant

Facebooktwittergoogle_pluslinkedinmailrss