There is a feature in the Linux Kernel that is relevant to VM’s hosted on Xen servers that is called the “steal percentage.” When the OS requests from the host system’s use of the CPU and the host CPU is currently tied up with another VM, the Xen server will send an increment to the guest Linux instance which increases the steal percentage. This is a great feature as it shows exactly how busy the host system is, and it is a feature available on many instances of AWS as they host using Xen. It is actually said that Netflix will terminate an AWS instance when the steal percentage crosses a certain threshold and start it up again, which will cause the instance to spin up in a new host server as a proactive step to ensure their system is utilizing their resources to the fullest.
What I wanted to discuss here is that it turns out there is a bug in the Linux kernel versions 4.8, 4.9 and 4.10 where the steal percentage can be corrupted during a live migration on the physical Xen server, which causes the CPU utilization to be reported as 100% by the agent.
When looking at Top you will see something like this:
As you can see in the screen shot of Top, the %st metric on the CPU(s) line shows an obviously incorect number.
During a live migration on the physical Xen server, the steal time gets a little out of sync and ends up decrementing the time. If the time was already at or close to zero, itcauses the time to become negative and, due to type conversions in the code, it causes an overflow.
CloudWatch’s CPU Utilization monitor calculates that utilization by adding the System and User percentages together. However, this only gives a partial view into your system. With our agent, we can see what the OS sees.
That is the Steal percentage spiking due to that corruption. Normally this metric could be monitored and actioned as desired, but with this bug it causes noise and false positives. If Steal were legitimately high, then the applications on that instance would be running much slower.
There is some discussion online about how to fix this issue, and there are some kernel patches to say “if the steal time is less than zero, just make it zero.” Eventually this fix will make it through the Linux releases and into the latest OS version, but until then it needs to be dealt with.
We have found that a reboot will clear the corrupted percentage. The other option is to patch the kernel… which also requires a reboot. If a reboot is just not possible at the time, the only impact to the system is that it makes monitoring the steal percentage impossible until the number is reset.
It is not a very common issue, but due to the large number of instances we monitor here at 2nd Watch, it is something that we’ve come across frequently enough to investigate in detail and develop a process around.
If you have any questions as to whether or not your servers hosted in the cloud might be effected by this issue, please contact us to discuss how we might be able to help.
Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.
In practice, this means that Terraform allows you to declare what you want your infrastructure to look like – in any cloud provider – and will automatically determine the changes necessary to make it so. Because of its simple syntax and cross-cloud compatibility, it’s 2nd Watch’s choice for infrastructure as code.
Pain You May Be Experiencing Working With Terraform
When you have multiple collaborators (individ
uals, teams, etc.) working on a Terraform codebase, some common problems are likely to emerge:
Enforcing peer review becomes difficult. In any codebase, you’ll want to ensure that your code is peer reviewed in order to ensure better quality in accordance with The Second Way of DevOps: Feedback. The role of peer review in IaC codebases is even more important. IaC is a powerful tool, but that tool has a double-edge – we are clearly more productive for using it, but that increased productivity also means that a simple typo could potentially cause a major issue with production infrastructure. In order to minimize the potential for bad code to be deployed, you should require peer review on all proposed changes to a codebase (e.g. GitHub Pull Requests with at least one reviewer required). Terraform’s open source offering has no facility to enforce this rule.
Terraform plan output is not easily integrated in code reviews. In all code reviews, you must examine the source code to ensure that your standards are followed, that the code is readable, that it’s reasonably optimized, etc. In this aspect, reviewing Terraform code is like reviewing any other code. However, Terraform code has the unique requirement that you must also examine the effect the code change will have upon your infrastructure (i.e. you must also review the output of a terraform plan command). When you potentially have multiple feature branches in the review process, it becomes critical that you are assured that the terraform plan output is what will be executed when you run terraform apply. If the state of infrastructure changes between a run of terraform plan and a run of terraform apply, the effect of this difference in state could range from inconvenient (the apply fails) to catastrophic (a significant production outage). Terraform itself offers locking capabilities but does not provide an easy way to integrate locking into a peer review process in its open source product.
Too many sets of privileged credentials. Highly-privileged credentials are often required to perform Terraform actions, and the greater the number principals you have with privileged access, the higher your attack surface area becomes. Therefore, from a security standpoint, we’d like to have fewer sets of admin credentials which can potentially be compromised.
What is Atlantis?
And what is Atlantis? Atlantis is an open source tool that allows safe collaboration on Terraform projects by making sure that proposed changes are reviewed and that the proposed change is the actual change which will be executed on your infrastructure. Atlantis is compatible (at the time of writing) with GitHub and Gitlab, so if you’re not using either of these Git hosting systems, you won’t be able to use Atlantis.
How Atlantis Works With Terraform
Atlantis is deployed as a single binary executable with no system-wide dependencies. An operator adds a GitHub or GitLab token for a repository containing Terraform code. The Atlantis installation process then adds hooks to the repository which allows communication to the Atlantis server during the Pull Request process.
You can run Atlantis in a container or a small virtual machine – the only requirement is that the Terraform instance can communicate with both your version control (e.g. GitHub) and infrastructure (e.g. AWS) you’re changing. Once Atlantis is configured for a repository, the typical workflow is:
A developer creates a feature branch in git, makes some changes, and creates a Pull Request (GitHub) or Merge Request (GitLab).
The developer enters atlantis plan in a PR comment.
Via the installed web hooks, Atlantis locally runs terraform plan. If there are no other Pull Requests in progress, Atlantis adds the resulting plan as a comment to the Merge Request.
If there are other Pull Requests in progress, the command fails because we can’t ensure that the plan will be valid once applied.
The developer ensures the plan looks good and adds reviewers to the Merge Request.
Once the PR has been approved, the developer enters atlantis apply in a PR comment. This will trigger Atlantis to run terraform apply and the changes will be deployed to your infrastructure.
The command will fail if the Merge Request has not been approved.
The following sequence diagram illustrates the sequence of actions described above:
Atlantis sequence diagram
We can see how our pain points in Terraform collaboration are addressed by Atlantis:
In order to ensure that your terraform plan accurately reflects the change to your infrastructure that will be made when you run terraform apply, Atlantis performs locking on a project or workspace basis: https://github.com/runatlantis/atlantis#locking
In order to prevent creating multiple sets of privileged credentials, you can deploy Atlantis to run on an EC2 instance with a privileged IAM role in its instance profile (e.g. in AWS). In this way, all of your Terraform commands run through a single set of privileged credentials and obviate the need to distribute multiple sets of privileged credentials: https://github.com/runatlantis/atlantis#aws-credentials
You can see that with minimal additional infrastructure you can establish a safe and reliable CI/CD pipeline for your infrastructure as code, enabling you to get more done safely! To find out how you can deploy a CI/CD pipeline in less than 60 days, Contact Us.
Alexa gets a lot of use in our house, and it is very apparent to me that the future is not a touch screen or a mouse, but voice. Creating an Alexa skill is easy to learn by watching videos and such, but actually creating the skill is a great way to understand the ins and outs of the process and what the backend systems (like AWS Lambda) are capable of.
First you need a problem
To get started, you need a problem to solve. Once you have the problem, you’ll need to think about the solution before you write a line of code. What will your skill do? You need to define the requirements. For my skill, I wanted to ask Alexa to “park my cloud” and have her stop all EC2 instances or RDS databases in my environment.
Building a solution one word at a time
Now that I’ve defined the problem and have an idea for the requirements of the solution, it’s time to start building the skill. The first thing you’ll notice is that the Alexa Skill port is not in the standard AWS portal. You need to go to developer.amazon.com/Alexa and create a developer account and sign in there. Once inside, there is a lot of good information and videos on creating Alexa skills that are worth reviewing. Click the “Create Skill” button to get started. In my example, I’m building a custom skill.
The process for building a skill is broken into major sections; Build, Test, Launch, Measure. In each one you’ll have a number of things to complete before moving on to the next section. The major areas of each section are broken down on the left-hand side of the console. On the initial dashboard you’re also presented with the “Skill builder checklist” on the right as a visual reminder of what you need to do before moving on.
This is the first area you’ll work on in the Build phase of your Alexa skill. This is setting up how your users will interact with your skill.
Invocation will setup how your users will launch your skill. For simplicity’s sake, this is often just the name of the skill. The common patterns will be “Alexa, ask [my skill] [some request],” or “Alexa, launch [my skill].” You’ll want to make sure the invocation for your skill sounds natural to a native speaker.
I think of intents as the “functions” or “methods” for my Alexa skill. There are a number of built-in intents that should always be included (Cancel, Help, Stop) as well as your custom intents that will compose the main functionality of your skill. Here my intent is called “park” since that will have the logic for parking my AWS systems. The name here will only be exposed to your own code, so it isn’t necessarily important what it is.
Utterances is your defined pattern of how people will use your skill. You’ll want to focus on natural language and normal patterns of speech for native users in your target audience. I would recommend doing some research and speaking to a diversity of people to get a good cross section of utterances for your skill. More is better.
Amazon also provides the option to use slots (variables) in your utterances. This allows your skill to do things that are dynamic in nature. When you create a variable in an utterance you also need to create a slot and give it a slot type. This is like providing a type to a variable in a programming language (Number, String, etc.) and will allow Amazon to understand what to expect when hearing the utterance. In our simple example, we don’t need any slots.
Interfaces allow you to interface your skill with other services to provide audio, display, or video options. These aren’t needed for a simple skill, so you can skip it.
Here’s where you’ll connect your Alexa skill to the endpoint you want to handle the logic for your skill. The easiest setup is to use AWS Lambda. There are lots of example Lambda blueprints using different programming languages and doing different things. Use those to get started because the json response formatting can be difficult otherwise. If you don’t have an Alexa skill id here, you’ll need to Save and Build your skill first. Then a skill id will be generated, and you can use it when configuring your Lambda triggers.
AWS Account Lambda
Assuming you already have an AWS account, you’ll want to deploy a new Lambda from a blueprint that looks somewhat similar to what you’re trying to accomplish with your skill (deployed in US-East-1). Even if nothing matches well, pick any one of them as they have the json return formatting set up so you can use it in your code. This will save you a lot of time and effort. Take a look at the information here and here for more information about how to setup and deploy Lambda for Alexa skills. You’ll want to configure your Alexa skill as the trigger for the Lambda in the configuration, and here’s where you’ll copy in your skill id from the developer console “Endpoints” area of the Build phase.
While the actual coding of the Lambda isn’t the purpose of the article, I will include a couple of highlights that are worth mentioning. Below, see the part of the code from the AWS template that would block the Lambda from being run by any Alexa skill other than my own. While the chances of this are rare, there’s no reason for my Lambda to be open to everyone. Here’s what that code looks like in Python:
if (event[‘session’][‘application’][‘applicationId’] != “amzn1.ask.skill.000000000000000000000000”):
raise ValueError(“Invalid Application ID”)
Quite simply, if the Alexa application id passed in the session doesn’t match my known Alexa skill id, then raise an error. The other piece of advice I’d give about the Lambda is to create different methods for each intent to keep the logic separated and easy to follow. Make sure you remove any response language from your code that is from the original blueprint. If your responses are inconsistent, Amazon will fail your skill (I had this happen multiple times because I borrowed from the “Color Picker” Lambda blueprint and had some generic responses left in the code). Also, you’ll want to handle your Cancel, Help, and Stop requests correctly. Lastly, as best practice in all code, add copious logging to CloudWatch so you can diagnose issues. Note the ARN of your Lambda function as you’ll need it for configuring the endpoints in the developer portal.
Once your Lambda is deployed in AWS, you can go back into the developer portal and begin testing the skill. First, put your Lambda function ARN into the endpoint configuration for your skill. Next, click over to the Test phase at the top and choose “Alexa Simulator.” You can try recording your voice on your computer microphone or typing in the request. I recommend you do both to get a sense of how Alexa will interpret what you say and respond. Note that I’ve found the actual Alexa is better at natural language processing than the test options using a microphone on my laptop. When you do a test, the console will show you the JSON input and output. You can take this INPUT pane and copy that information to build a Lambda test script on your Lambda function. If you need to do a lot of work on your Lambda, it’s a lot easier to test from there than to flip back and forth. Pay special attention to your utterances. You’ll learn quickly that your proposed utterances weren’t as natural as you thought. Make updates to the utterances and Lambda as needed and keep testing.
Now you wait. Amazon seems to have a number of automated processes that catch glaring issues, but you will likely end up with some back and forth between yourself and an Amazon employee regarding some part of your skill that needs to be updated. It took about a week to get my final approval and my skill posted.
Creating your own simple Alexa skill is a fun and easy way to get some experience creating applications that respond to voice and understand what’s possible on the platform. Good luck!
In cloud migrations, the elastic nature of the cloud is often touted as a critical capability in delivering on a business’ key initiatives. However, if not accounted for in your Security and Compliance plans, you could be facing some real challenges. Always counting on a virtual host to be running, for example, will cause issues when that host is rebooted or retired. This is why managing Security and Compliance in the cloud is a continuous action requiring both forethought and automation.
At AWS re:Invent 2017, 2nd Watch hosted a breakout session titled “Continuous Compliance on AWS at Scale” where attendees learned how a leading, next generation, Managed Cloud Provider uses automation and cloud expertise to successfully manage Security and Compliance at scale in an ever-changing environment. This journey starts with account creation, goes through deployment of infrastructure and code and never ends.
Through code examples and live demos, presenters Peter Meister and Lars Cromley demonstrated the tools and automation you can use to provide continuous compliance of your cloud infrastructure from inception to ongoing management. In case you missed the session or simply wish to get a refresher on the content that was presented, you can now view the breakout session recording below.
“Whatever you do in life, surround yourself with smart people who argue with you.” – John Wooden
Many AWS customers and practitioners have leveraged the Well-Architected Framework methodology in building new applications or migrating existing applications. Once a build or migration is complete, how many companies implement Well-Architected Framework reviews and perform those reviews regularly? We have found that many companies today do not conduct regular Well Architected Framework reviews and as a result, potentially face a multitude of risks.
What is a Well-Architected Framework?
The Well-Architected Framework is a methodology designed to provide high-level guidance on best practices when using AWS products and services. Whether building new or migrating existing workloads, security, reliability, performance, cost optimization, and operational excellence are vital to the integrity of the workload and can even be critical to the success of the company. A review of your architecture is especially critical when the rate of innovation of new products and services are being created and implemented by Cloud Service Providers (CSP).
2nd Watch Well-Architected Framework Reviews
At 2nd Watch, we provide Well-Architected Framework reviews for our existing and prospective clients. The review process allows customers to make informed decisions about architecture decisions, the potential impact those decisions have on their business, and tradeoffs they are making. 2nd Watch offers its clients free Well-Architected Framework reviews—conducted on a regular basis—for mission-critical workloads that could have a negative business impact upon failure.
Examples of issues we have uncovered and remediated through Well-Architected Reviews:
Security: Not protecting data in transit and at rest through encryption
Cost: Low utilization and inability to map cost to business units
Reliability: Single points of failure where recovery processes have not been tested
Performance: A lack of benchmarking or proactive selection of services and sizing
Operations: Not tracking changes to configuration management on your workload
Using a standard based methodology, 2nd Watch will work closely with your team to thoroughly review the workload and will produce a detailed report outlining actionable items, timeframes, as well as provide prescriptive guidance in each of the key architectural pillars.
In reviewing your workload and architecture, 2nd Watch will identify areas of improvement, along with a detailed report of our findings. A separate paid engagement will be available to clients and prospects who want our AWS Certified Solutions Architects and AWS Certified DevOps Engineer Professionals to remediate our findings. To schedule your free Well-Architected Framework review, contact 2nd Watch today.
Thursday’s General Session Keynote kicked off with Amazon CTO, Werner Vogels, taking the stage to deliver additional product and services announcements with the inclusion of deeper, technical content. Revisiting his vision for 21st Architectures from the 1st Re:Invent in 2012, Werner focused on what he sees as key guiding principles for next-gen workloads.
Voice represents the next major disruption in computing. Stressing this point, Werner announced the general availability of Alexa for Business to help improve productivity by introducing voice automation into your business.
Use automation to make experimentation easier
Encryption is the ‘key’ to controlling access to your data. As such, encrypting data (at rest and in transit) should be a default behavior.
All the code you should ever write is business logic.
Werner also highlighted the fact that AWS now has over 3,951 new services released since 2012. These services were not built for today but built for the workloads of the future. The goal for AWS, Werner says, is to be your partner for the future.
One of the highlights of the keynote was when Abby Fuller, evangelist for containers at AWS, came on stage to talk about the future of containers at AWS. She demoed the use of Fargate which is AWS’s fully managed container service. Think of Fargate as Elastic Beanstalk but for containers. Per AWS documentation “It’s a technology that allows you to use containers as a fundamental compute primitive without having to manage the underlying instances. All you need to do is build your container image, specify the CPU and memory requirements, define your networking and IAM policies, and launch. With Fargate, you have flexible options to closely match your application needs and you’re billed with per-second granularity.”
The Cloud9 acquisition was also a highlight of the keynote. Cloud9 is a browser-based IDE for developers. Cloud9 is completely integrated with AWS and you can create cloud environments, develop code, and push that code to your cloud environment all from within the tool. It’s really going to be useful for writing and debugging lambda functions for developers that have gone all in on serverless technologies.
AWS Lambda Function Execution Activity Logging – Log all execution activity for your Lambda functions. Previously you could only log events but this allows you to log data events and get additional details.
AWS Lambda Doubles Maximum Memory Capacity for Lambda Functions – You can now allocate 3008MB of memory to your AWS Lambda functions.
AWS Cloud9 – Cloud9 is a cloud based IDE for writing, running, and debugging your code.
API Gateway now supports endpoint integrations with Private VPCs – You can now provide access to HTTP(S) resources within your Amazon Virtual Private Cloud (VPC) without exposing them directly to the public Internet.
AWS Serverless Application Repository – The Serverless Application Repository is a collection of serverless applications published by developers, companies, and partners in the serverless community.
We expect AWS to announce many more awesome features and services before the day ends so stay tuned for our AWS re:Invent 2017 Products & Services Review and 2017 Conference Recap blog posts for a summary of all of the announcements that are being delivered at AWS re:Invent 2017.