1-888-317-7920 info@2ndwatch.com

Standardizing & Automating Infrastructure Development Processes

Introduction

Let’s start with a small look at the current landscape of technology and how we arrived here. There aren’t very many areas of tech that have not been, or are not currently, in a state of fluctuation. Everything from software delivery vehicles and development practices, to infrastructure creation has experienced some degree of transformation over the past several years. From VMs to Containers, it seems like almost every day the technology tool belt grows a little bigger, and our world gets a little better (though perhaps more complex) due to these advancements. For me, this was incredibly apparent when I began to delve into configuration management which later evolved into what we now call “infrastructure as code”.

The transformation of the development process began with simple systems that we once used to manage a few machines (like bash scripts or Makefiles) which then morphed into more complex systems (CF Engine, Puppet, and Chef) to manage thousands of systems. As configuration management software became more mature, engineers and developers began leaning on them to do more things. With the advent of hypervisors and the rise of virtual machines, it was only a short time before hardware requests changed to API requests and thus the birth of infrastructure as a service (IaaS). With all the new capabilities and options in this brave new world, we once again started to lean on our configuration management systems—this time for provisioning, and not just convergence.

Provisioning & Convergence

I mentioned two terms that I want to clarify; provisioning and convergence. Say you were a car manufacturer and you wanted to make a car. Provisioning would be the step in which you request the raw materials to make the parts for your automobile. This is where we would use tools like Terraform, CloudFormation, or Heat. Whereas convergence is the assembly line by which we check each part and assemble the final product (utilizing config management software).

By and large, the former tends to be declarative with little in the way of conditionals or logic, while the latter is designed to be robust and malleable software that supports all the systems we run and plan on running. This is the frame for the remainder of what we are going to talk about.

By separating the concerns of our systems, we can create a clear delineation of the purpose for each tool so we don’t feel like we are trying to jam everything into an interface that doesn’t have the most support for our platform or more importantly our users. The remainder of this post will be directed towards the provisioning aspect of configuration management.

Standards and Standardization

These are two different things in my mind. Standardization is extremely prescriptive and can often seem particularly oppressive to professional knowledge workers, such as engineers or developers. It can be seen as taking the innovation away from the job. Whereas standards provide boundaries, frame the problem, and allow for innovative ways of approaching solutions. I am not saying standardization in some areas is entirely bad, but we should let the people who do the work have the opportunity to grow and innovate in their own way with guidance. The topic of standards and standardization is part of a larger conversation about culture and change. We intend to follow up with a series of blog articles relating to organizational change in the era of the public cloud in the coming weeks.

So, let’s say that we make a standard for our new EC2 instances running Ubuntu. We’ll say that all instances must be running the la official Canonical Ubuntu 14.04 AMI and must have these three tags; Owner, Environment, and Application. How can we enforce that in development of our infrastructure? On AWS, we can create AWS Config Rules, but that is reactive and requires ad-hoc remediation. What we really want is a more prescriptive approach bringing our standards closer to the development pipeline. One of the ways I like to solve this issue is by creating an abstraction. Say we have a terraform template that looks like this:

# Create a new instance of the la Ubuntu 14.04 on an
provider "aws" { region = "us-west-2"
}

data "aws_ami" "ubuntu" { most_recent = true

filter {
name	= "name" values =
["ubuntu/images/hvm-ssd/ubuntu-trusty-1 4.04-amd64-server-*"]
}

filter {
name	= "virtualization-type" values = ["hvm"]
}

owners = ["099720109477"] # Canonical
}

resource "aws_instance" "web" { ami	=
"${data.aws_ami.ubuntu.id}" instance_type = "t2.micro"

tags {
Owner	= "DevOps Ninja" Environment = "Dev" Application = "Web01"
}
}

This would meet the standard that we have set forth, but we are relying on the developer or engineer to adhere to that standard. What if we enforce this standard by codifying it in an abstraction? Let’s take that existing template and turn it into a terraform module instead.

Module

# Create a new instance of the la Ubuntu 14.04 on an

variable "aws_region" {} variable "ec2_owner" {} variable "ec2_env" {} variable "ec2_app" {}
variable "ec2_instance_type" {}

provider "aws" {
region = "${var.aws_region}"
}

data "aws_ami" "ubuntu" { most_recent = true

filter {
name	= "name" values =
["ubuntu/images/hvm-ssd/ubuntu-trusty-1 4.04-amd64-server-*"]
}

filter {
name	= "virtualization-type" values = ["hvm"]
}

owners = ["099720109477"] # Canonical
}

resource "aws_instance" "web" { ami	=
"${data.aws_ami.ubuntu.id}" instance_type =
"${var.ec2_instance_type}"

tags {
Owner	= "${var.ec2_owner}" Environment = "${var.ec2_env}" Application = "${var.ec2_app}"
}
}

Now we can have our developers and engineers leverage our tf_ubuntu_ec2_instance module.

New Terraform Plan

module "Web01" { source =
"git::ssh://git@github.com/SomeOrg/tf_u buntu_ec2_instance"

aws_region = "us-west-2" ec2_owner = "DevOps Ninja" ec2_env	= "Dev"
ec2_app	= "Web01"
}

This doesn’t enforce the usage of the module, but it does create an abstraction that provides an easy way to maintain standards without a ton of overhead, it also provides an example for further creation of modules that enforce these particular standards.

This leads us into another method of implementing standards but becomes more prescriptive and falls into the category of standardization (eek!). One of the most underutilized services in the AWS product stable has to be Service Catalog.

AWS Service Catalog allows organizations to create and manage catalogs of IT services that are approved for use on AWS. These IT services can include everything from virtual machine images, servers, software, and databases to complete multi-tier application architectures. AWS Service Catalog allows you to centrally manage commonly deployed IT services, and helps you achieve consistent governance and meet your compliance requirements, while enabling users to quickly deploy only the approved IT services they need.

The Interface

Once we have a few of these projects in place (e.g. a service catalog or a repo full of composable modules for infrastructure that meet our standards) how do we serve them out? How you spur adoption of these tools and how they are consumed can be very different depending on your organization structure. We don’t want to upset workflow and how work gets done, we just want it to go faster and be more reliable. This is what we talk about when we mention the interface. Whichever way work flows in, we should supplement it with some type of software or automation to link those pieces of work together. Here are a few examples of how this might look (depending on your organization):

1.) Central IT Managed Provisioning

If you have an organization that manages requests for infrastructure, having this new shift in paradigm might seem daunting. The interface in this case is the ticketing system. This is where we would create an integration with our ticketing software to automatically pull the correct project from service catalog or module repo based on some criteria in the ticket. The interface doesn’t change but is instead supplemented by some automation to answer these requests, saving time and providing faster delivery of service.

2.) Full Stack Engineers

If you have engineers that develop software and the infrastructure that runs their applications this is the easiest scenario to address in some regards and the hardest in others. Your interface might be a build server, or it could simply be the adoption of an internal open source model where each team develops modules and shares them in a common place, constantly trying to save time and not re-invent the wheel.

Supplementing with software or automation can be done in a ton of ways. Check out an example Kelsey Hightower wrote using Jira.

“A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.” – John Gall

All good automation starts out with a manual and well-defined process. Standardizing & automating infrastructure development processes begins with understanding how our internal teams can be best organized.  This allows us to efficiently perform work before we can begin automating. Work with your teammates to create a value stream map to understand the process entirely before doing anything towards the efforts of automating a workflow.

With 2nd Watch designs and automation you can deploy quicker, learn faster and modify as needed with Continuous Integration / Continuous Deployment (CI/CD). Our Workload Solutions transform on-premises workloads to digital solutions in the public cloud with next generation products and services.  To accelerate your infrastructure development so that you can deploy faster, learn more often and adapt to customer requirements more effectively, speak with a 2nd Watch cloud deployment expert today.

– Lars Cromley, Director of Engineering, Automation, 2nd Watch

Facebooktwittergoogle_pluslinkedinmailrss

Automating Windows Patching of EC2 Autoscaling Group Instances

Background

Dealing with Windows patching can be a royal pain as you may know.  At least once a month Windows machines are subject to system security and stability patches, thanks to Microsoft’s Patch Tuesday. With Windows 10 (and its derivatives), Microsoft has shifted towards more of a Continuous Delivery model in how it manages system patching. It is a welcome change, however, it still doesn’t guarantee that Windows patching won’t require a system reboot.

Rebooting an EC2 instance that is a member of an Auto Scaling Group (depending upon how you have your Auto Scaling health-check configured) is something that will typically cause an Elastic Load Balancing (ELB) HealthCheck failure and result in instance termination (this occurs when Auto Scaling notices that the instance is no longer reporting “in service” with the load balancer). Auto Scaling will of course replace the terminated instance with a new one, but the new instance will be launched using an image that is presumably unpatched, thus leaving your Windows servers vulnerable. The next patch cycle will once again trigger a reboot and the vicious cycle continues. Furthermore, if the patching and reboots aren’t carefully coordinated, it could severely impact your application performance and availability (think multiple Auto Scaling Group members rebooting simultaneously). If you are running an earlier version of Windows OS (e.g. Windows Server 2012r2), rebooting at least once a month on Patch Tuesday is an almost certainty.

Another major problem with utilizing the AWS stock Windows AMIs with Auto Scaling is that AWS makes those AMIs unavailable after just a few months. This means that unless you update your Auto Scaling Launch Configuration to use the newer AMI IDs on a continual basis, future Auto Scaling instance launches will fail as they try to access an AMI that is no longer accessible. Anguish.

Potential Solutions

Given the aforementioned scenario, how on earth are you supposed to automatically and reliably patch your Auto-Scaled Windows instances?!

One approach would be to write some sort of an orchestration layer that detects when Auto Scaling members have been patched and are awaiting their obligatory reboot, suspend Auto Scaling processes that would detect and replace perceived failed instances (e.g. HealthCheck), and then reboot the instances one-by-one. This would be rather painful to orchestrate and has a potentially severe drawback that cluster capacity is reduced by N-1 during the rebooting (maybe more if you don’t take into account service availability between reboots). Reducing capacity to N-1 might not be a big deal if you have a cluster of 20 instances but if you are running a smaller cluster of something— say 4, 3, or 2 instances—then that has a significant impact to your overall cluster capacity. And, if you are running on an Auto Scaling group with a single instance (not as uncommon as you might think) then your application is completely down during the reboot of that single member. This of course doesn’t solve the issue of expired stock AWS AMIs.

Another approach is to maintain and patch a “golden image” that the Auto Scaling Launch Configuration uses to create new instances from. If you are unfamiliar with the term, a golden-image is an operating system image that has everything pre-installed, configured, and saved in a pre-baked image file (an AMI in the case of Amazon EC2). This approach requires a significant amount of work to make this happen in a reasonably automated fashion and has numerous potential pitfalls.

While it prevents a service outage by replacing the unavailable public AMI with a stock AMI, you still need a way to reliably and automatically handle this process. Using a tool like Hashicorp’s Packer can get you partially there, but you would still have to write a number of Providers to handle the installation of Windows Update and anything else you need to do in order to prep the system for imaging. In the end, you would still have to develop or employ a fair number of tools and processes to completely automate the entire process of detecting new Windows Updates, creating a patched AMI with those updates, and orchestrating the update of your Auto Scaling Groups.

A Cloud-Minded Approach

I believe that Auto Scaling Windows servers intelligently requires a paradigm shift. One assumption we have to make is that some form of configuration management (e.g. Puppet, Chef)—or at least a basic bootstrap script executed via cfn-init/UserData—is automating the configuration of the operating system, applications, and services upon instance launch. If configuration management or bootstrap scripts are not in play, then it is likely that a golden-image is being utilized. Without one of these two approaches, you don’t have true Auto Scaling because it would require some kind of human interaction to configure a server (ergo, not “auto”) every time a new instance was created.

Both approaches (launch-time configuration vs. golden-image) have their pros and cons. I generally prefer launch-time configuration as it allows for more flexibility, provides for better governance/compliance, and enables pushing changes dynamically. But…(and this is especially true of Windows servers) sometimes launch-time configuration simply takes longer to happen than is acceptable, and the golden-image approach must be used to allow for a more rapid deployment of new Auto Scaling group instances.

Either approach can be easily automated using a solution like to the one I am about to outline, and thankfully AWS publishes new stock Windows Server AMIs immediately following every Patch Tuesday.  This means, if you aren’t using a golden-image, patching your instances is as simple as updating your Auto Scaling Launch Configuration to use the new AMI(s) and preforming a rolling replacement of the instances. Even if you are using a golden-image or applying some level of customization to the stock AMI, you can easily integrate Packer into the process to create a new patched image that includes your customizations.

The Solution

At a high level, the solution can be summarized as:

1.  An Orchestration Layer (e.g. AWS SNS and Lambda, Jenkins, AWS Step Functions) that detects and responds when new patched stock Windows AMIs have been released by Amazon.

2.  A Packer Launcher process that manages launching Packer jobs in order to create custom AMIs. Note: This step is only required If copying AWS stock AMIs to your own AWS account is desired OR if you want to apply customization to the stock AMI. Either use case requires that the custom images are available indefinitely. We solved this problem by creating a Packer Launcher process by creating an EC2 instance with a Python UserData script that launches Packer jobs (in parallel) to create copies of the new stock AMIs into our AWS account. Note: if you are using something like Jenkins, this could be handled by having Jenkins launch a local script or even a Docker container to manage launching Packer jobs.

3.  A New AMI Messaging Layer (e.g. Amazon SNS) to publish notifications when new/patched AMIs have been created

4.  Some form of an Auto Scaling Group Rolling Updater will be required to replace exiting Auto Scaling Group instances with new ones based on the Patched AMI.

Great news for anyone using AWS CloudFormation… CFT inherently supports Rolling Updates for Auto Scaling Groups! Utilizing it requires attaching an UpdatePolicy and adding a UserData or cfn-init script to notify CloudFormation when the instance has finished its configuration and is reporting as healthy (e.g. InService on the ELB). There are some pretty good examples of how to accomplish this using CloudFormation out there, but here is one specifically that AWS provides as an example.

If you aren’t using CloudFormation, all hope is not lost. With Hashicorp Terraform’s ever increasing popularity for deploying and managing AWS infrastructure as code, Terraform has still yet to implement a Rolling Update feature for AWS Auto Scaling Groups. There is a Terraform feature request from a few years ago for this exact feature, but as of today, it is not yet available, nor do the Terraform developers have any short-term plans to implement it. However, several people (including Hashicorp’s own engineers) have developed a number of ways to work around the lack of an integrated Auto Scaling Group Rolling Updater in Terraform. Here are a few I like:

a.  Using a nested CloudFormation Template to manage the AutoScaling group (and utilizing AutoScalings Update Policy as described above)

b.  Using Terraform’s create_before_destroy and min_elb_capacity parameters to gracefully create replacement Auto Scaling Groups and Launch Configurations

c.  Utilizing the make system as a wrapper to augment the previous approach

Of course, you can always roll your own solution using a combination of AWS services (e.g. SNS, Lambda, Step Functions), or whatever tooling best fits your needs. Creating your own solution will allow you added flexibility if you have additional requirements that can’t be met by CloudFormation, Terraform, or other orchestration tool.

The following is an example framework for performing automated Rolling Updates to Auto Scaling Groups utilizing AWS SNS and AWS Lambda:

a.  An Auto Scaling Launch Config Modifier worker that subscribes to the New AMI messaging layer performs an update to the Auto Scaling Launch Configuration(s) when a new AMI is released. In this use case, we are using an AWS Lambda function to subscribe to an SNS topic. Upon notification of new AMIs, the worker must then update the predefined (or programmatically derived) Auto Scaling Launch Configurations to use the new AMI. This is best handled by using infrastructure templating tools like CloudFormation or Terraform to make updating the Auto Scaling Launch Configuration ImageId as simple as updating a parameter/variable in the template and performing an update/apply operation.

b.  An Auto Scaling Group Instance Cycler messaging layer (again, an Amazon SNS topic) to be notified when an Auto Scaling Launch Configuration ImageId has been updated by the worker.

c.  An Auto Scaling Group Instance Cycler worker that will perform replacing the Auto Scaling Group instances in a safe, reliable, and automated fashion. For example, another AWS Lambda function that will subscribe to the SNS topic and trigger new instances by increasing the Auto Scaling Desired Instance count to a value of twice the current number of ASG instances.

d.  Once the scale-up event generated by the Auto Scaling Group Instance Cycler worker has completed and the new instances are reporting as healthy, another message will be published to the Auto Scaling Group Instance Cycler SNS topic indicating scale-up has completed.

e.  The Auto Scaling Group Instance Cycler worker will respond to the prior event and return the Auto Scaling group back to its original size which will terminate the older instances leaving the Auto Scaling Group with only the patched instances launched from the updated AMI. This assumes that we are utilizing the default AWS Auto Scaling Termination Policy which ensures that instances launched from the oldest Launch Configurations are terminated first.

NOTE: The AWS Auto Scaling default termination policy will not guarantee that the older instances are terminated first! If the Auto Scaling Group is spanned across multiple Availability Zones (AZ) and there is an imbalance in the number of instances in each AZ, it will terminate the extra instance(s) in that AZ before terminating based on the oldest Launch Configuration. Terminating on Launch Configuration age will certainly ensure that the oldest instances will be replaced first. My recommendation is to use the OldestInstance termination policy to make absolutely certain that the oldest (i.e. unpatched) instances are terminated during the Instance Cycler scale-down process.  Consult the AWS documentation on the Auto Scaling termination policies for more on this topic.

In Conclusion

Whichever solution you choose to implement to handle the Rolling Updates to your Auto Scaling Group, the solution outlined above will provide you with a sure-fire way to ensure your Windows Auto Scaled servers are always patched automatically and minimize the operational overhead for ensuring patch compliance and server security. And the good news is that the heavy lifting is already being handled by AWS Auto Scaling and Hashicorp Packer. There is a bit of trickery to getting the Packer configs and provisioners working just right with the EC2 Config service and Windows Sysprep, but there are a number of good examples out on github to get you headed in the right direction. The one I referenced in building our solution can be found here.

One final word of caution... if you do not disable the EC2Config Set Computer Name option when baking a custom AMI, your Windows hostname will ALWAYS be reset to the EC2Config default upon reboot. This is especially problematic for configuration management tools like Puppet or Chef which may use the hostname as the SSL Client Certificate subject name (default behavior), or for deriving the system role/profile/configuration.

Here is my ec2config.ps1 Packer provisioner script which disables the Set Computer Name option:

$EC2SettingsFile="C:\\Program
Files\\Amazon\\Ec2ConfigService\\Settin
gs\\Config.xml"
$xml = [xml](get-content
$EC2SettingsFile)
$xmlElement =
$xml.get_DocumentElement()
$xmlElementToModify =
$xmlElement.Plugins
foreach ($element in
$xmlElementToModify.Plugin)
{
if ($element.name -eq
"Ec2SetPassword")
{
$element.State="Enabled"
}
elseif ($element.name -eq
"Ec2SetComputerName")
{
$element.State="Disabled"
}
elseif ($element.name -eq
"Ec2HandleUserData")
{
$element.State="Enabled"
}
elseif ($element.name -eq
"Ec2DynamicBootVolumeSize")
{
$element.State="Enabled"
}
}
$xml.Save($EC2SettingsFile)

Hopefully, at this point, you have a pretty good idea of how you can leverage existing software, tools, and services—combined with a bit of scripting and automation workflow—to reliably and automatically manage the patching of your Windows Auto Scaling Group EC2 instances!  If you require additional assistance, are resource-bound for getting something implemented, or you would just like the proven Cloud experts to manage Automating Windows Patching of your EC2 Autoscaling Group Instances, contact 2nd Watch today!

 

Disclaimer

We strongly advise that processes like the ones described in this article be performed on a environment prior to production to properly validate that the changes have not negatively affected your application’s functionality, performance, or availability.

 This is something that your orchestration layer in the first step should be able to handle. This is also something that should integrate well with a Continual Integration and/or Delivery workflow.

 

-Ryan Kennedy, Principal Cloud Automation Architect, 2nd Watch

Facebooktwittergoogle_pluslinkedinmailrss

Cloud Automation Increases Agility

In April 2017, we sponsored an online survey focused on cloud automation in order to understand if—and how—corporate IT departments are using automation to develop and deliver new workloads and applications. More than 1,000 IT professionals from US companies with at least 1,000 employees participated in the survey. The majority of respondents (56%) said that at least half of their deployment pipelines are now automated, and 63% said they can deploy new applications in less than six weeks.

According to the results of the survey, companies that have embraced cloud automation can deploy new applications and workloads faster and more frequently, while recovering from failures with more agility than organizations that struggle to adopt automated processes, ing and monitoring. Furthermore, per the survey results, 41% of corporate IT departments are producing more than 10 new cloud workloads every year, and 56% have automated at least half of all their artifact creation and deployment pipelines. Another 66% said that at least half of all their quality assessments (lint, unit s, etc.) are automated.

“The survey results reiterate what we’re hearing from clients and prospects: automation, driven by cloud technologies, is critical to the rapid delivery of new workloads and applications,” says Jeff Aden, EVP of Marketing & Strategic Business Development & Co-Founder at 2nd Watch. “Companies are automating everything from artifact creation to deployment pipelines and process, which includes metrics, documentation and data. The result is faster time-to-market for new applications, and less application downtime.”

More survey results:

  • 63% said that deploying new applications takes less than six weeks
  • 44% said that deploying new code to production takes a day or less
  • 54% said they are deploying new code changes at least once a week
  • 50% said it takes a day or less to recover from application failure
  • 55% said they are measuring application quality by ing everything

 

Download the infographic highlighting the results of the Cloud Automation survey here.  For questions about how 2nd Watch can help you embrace cloud automation, please contact us today!

 

-Katie Laas-Ellis, Marketing Manager, 2nd Watch

Facebooktwittergoogle_pluslinkedinmailrss

Migrating Terraform Remote State to a “Backend” in Terraform v.0.9+

(AKA: Where the heck did ‘terraform remote config’ go?!!!)

If you are working with cloud-based architectures or working in a DevOps shop, you’ve no doubt been managing your infrastructure as code. It’s also likely that you are familiar with tools like Amazon CloudFormation and Terraform for defining and building your cloud architecture and infrastructure. For a good comparison on Amazon CloudFormation and Terraform check out Coin Graham’s blog on the matter: AWS CFT vs. Terraform: Advantages and Disadvantages.

If you are already familiar with Terraform, then you may have encountered a recent change to the way remote state is handled, starting with Terraform v0.9. Continue reading to find out more about migrating Terraform Remote State to a “Backend” in Terraform v.0.9+.

First off… if you are unfamiliar with what remote state is check out this page.

Remote state is a big ol’ blob of JSON that stores the configuration details and state of your Terraform configuration and infrastructure that has actually been deployed. This is pretty dang important if you ever plan on changing your environment (which is “likely”, to put it lightly) and especially important if you want to have more than one person managing/editing/maintaining the infrastructure, or if you have even the most basic rationale as it pertains to backup and recovery.

Terraform supports almost a dozen backend types (as of this writing) including:

  • Artifactory
  • Azure
  • Consul
  • Etcd
  • Gcs
  • Http
  • Manta
  • S3
  • Swift
  • Terraform Enterprise (AKA: Atlas)

 

Why not just keep the Terraform state in the same git repo I keep the Terraform code in?

You also don’t want to store the state file in a code repository because it may contain sensitive information like DB passwords, or simply because the state is prone to frequent changes and it might be easy to forget to push those changes to your git repo any time you run Terraform.

So, what happened to terraform remote anyway?

If you’re like me, you probably run the la version of HashiCorp’s Terraform tool as soon as it is available (we actually have a hook in our team Slack channel that notifies us when a new version is released). With the release of Terraform v.0.9 last month, we were endowed with the usual heaping helping of excellent new features and bug-fixes we’ve come to expect from the folks at HashiCorp, but were also met with an unexpected change in the way remote state is handled.

Unless you are religious about reading the release notes, you may have missed an important change in v.0.9 around the remote state. While the release notes don’t specifically call out the removal (not even deprecation, but FULL removal) of the prior method (e.g. with Terraform remote config, the Upgrade Guide specifically calls out the process in migrating from the legacy method to the new method of managing remote state). More specifically, they provide a link to a guide for migrating from the legacy remote state config to the new backend system. The steps are pretty straightforward and the new approach is much improved over the prior method for managing remote state. So, while the change is good, a deprecation warning in v.0.8 would have been much appreciated. At least it is still backwards compatible with the legacy remote state files (up to version 0.10), making the migration process much less painful.

Prior to v.0.9, you may have been managing your Terraform remote state in an S3 bucket utilizing the Terraform remote config command. You could provide arguments like: backend and backend-config to configure things like the S3 region, bucket, and key where you wanted to store your remote state. Most often, this looked like a shell script in the root directory of your Terraform directory that you ran whenever you wanted to initialize or configure your backend for that project.

Something like…

Terraform Legacy Remote S3 Backend Configuration Example
#!/bin/sh
export AWS_PROFILE=myprofile
terraform remote config \
--backend=s3 \
--backend-config="bucket=my-tfstates" \
--backend-config="key=projectX.tfstate" \
--backend-config="region=us-west-2"

This was a bit clunky but functional. Regardless, it was rather annoying having some configuration elements outside of the normal terraform config (*.tf) files.

Along came Terraform v.0.9

The introduction of Terraform v.0.9 with its new fangled “Backends” makes things much more seamless and transparent.  Now we can replicate that same remote state backend configuration with a Backend Resource in a Terraform configuration like so:

Terraform S3 Backend Configuration Example
terraform {
  backend "s3" {
    bucket = "my-tfstates"
    key    = "projectX.tfstate"
    region = "us-west-2"
  }
}
A Migration Example

So, using our examples above let’s walk through the process of migrating from a legacy “remote config” to a “backend”.  Detailed instructions for the following can be found here.

1. (Prior to upgrading to Terraform v.0.9+) Pull remote config with pre v.0.9 Terraform

> terraform remote pull
Local and remote state in sync

2. Backup your terraform.tfstate file

> cp .terraform/terraform.tfstate 
/path/to/backup

3. Upgrade Terraform to v.0.9+

4. Configure the new backend

terraform {
  backend "s3" {
    bucket = "my-tfstates"
    key    = "projectX.tfstate"
    region = "us-west-2"
  }
}

5. Run Terraform init

> terraform init
Downloading modules (if any)...
 
Initializing the backend...
New backend configuration detected with legacy remote state!
 
Terraform has detected that you're attempting to configure a new backend.
At the same time, legacy remote state configuration was found. Terraform will
first configure the new backend, and then ask if you'd like to migrate
your remote state to the new backend.
 
 
Do you want to copy the legacy remote state from "s3"?
  Terraform can copy the existing state in your legacy remote state
  backend to your newly configured backend. Please answer "yes" or "no".
 
  Enter a value: no
  
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Terraform has been successfully initialized!
 
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
 
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your environment. If you forget, other
commands will detect it and remind you to do so if necessary.

6. Verify the new state is copacetic

> terraform plan
 
...
 
No changes. Infrastructure is up-to-date.
 
This means that Terraform did not detect any differences between your
configuration and real physical resources that exist. As a result, Terraform
doesn't need to do anything.

7.  Commit and push

In closing…

Managing your infrastructure as code isn’t rocket science, but it also isn’t trivial.  Having a solid understanding of cloud architectures, the Well Architected Framework, and DevOps best practices can greatly impact the success you have.  A lot goes into architecting and engineering solutions in a way that maximizes your business value, application reliability, agility, velocity, and key differentiators.  This can be a daunting task, but it doesn’t have to be!  2nd Watch has the people, processes, and tools to make managing your cloud infrastructure as code a breeze! Contact us today to find out how.

 

— Ryan Kennedy, Principal Cloud Automation Architect, 2nd Watch

 

Facebooktwittergoogle_pluslinkedinmailrss

Budgets: The Simple Way to Encourage Cloud Cost Accountability

Controlling costs is one of the grea challenges facing IT and Finance managers today.  The cloud, by nature, makes it easy to spin up new environments and resources that can cost thousands of dollars each month. And, while there are many ways to help control costs, one of the simplest and most effective methods is to set and manage cloud spend-to-budget. While most enterprise budgets are set at a business unit or department, for cloud spend, mapping that budget down to the workload can establish strong accountability within the organization.

One popular method that workload owners use to manage spend is to track month-over-month cost variances.  However, if costs do not drastically increase from one month to another, this method does very little to control spend. It is only until a department is faced with budget issues that workload owners work diligently to reduce costs.  That’s because, when budgets are set for each workload, owners become more aware of how their cloud spend impacts the company financials and tend to more carefully manage their costs.

In this post, we provide four easy steps to help you manage workload spend-to-budget effectively.

Step 1: Group Your Cloud Resources by Workload and Environment

Use a financial management tool such as 2nd Watch CMP Finance Manager to group your cloud resources by workload and its environment (Test, Dev, Prod).  This can easily be accomplished by creating a standard where each workload/environment has its own cloud account, or by using tags to identify the resources associated with each workload. If using tags, use a tag for the workload name such as workload_name: and a tag for the environment such as environment:. More tagging best practices can be found here.

Step 2: Group Your Workloads and Environments by Business Group

Once your resources are grouped by workload/environment, CMP Finance Manager will allow you to organize your workload/environments into business groups. For example:

a. Business Group 1
i. Workload A
1. Workload A Dev
2. Workload A Test
3. Workload A Prod
ii. Workload B
1. Workload B Dev
2. Workload B Test
3. Workload B Prod
b. Business Group 2
i. Workload C
1. Workload C Dev
2. Workload C Test
3. Workload C Prod
ii. Workload D
1. Workload D Dev
2. Workload D Test
3. Workload D Prod

Step 3: Set Budgets

At this point, you are ready to set up budgets for each of your workloads (each workload/environment and the total workload as you may have different owners). We suggest you set annual budgets aligned to your fiscal year and have the tool you use programmatically recalculate the budget at the end of each month with the amount remaining in your annual budget.

Step 4: Create Alerts

The final step is to create alerts to notify owners and yourself when workloads either have exceeded or are on track to exceed the current month or annual budget amount.  Here are some budget notifications we recommend:

  1. ME forecast exceeds month budget
  2. MTD spend exceeds MTD budget
  3. MTD spend exceeds month budget
  4. Daily spend exceed daily budget
  5. YE forecast exceeds year budget
  6. YTD spend exceeds YE budget

Once alerts are set, owners can make timely decisions regarding spend.  The owner can now proactively shift to spot instances, purchase reserved instances, change instance sizes, park the environment when not in use, or even refactor the application to take advantage of cloud native services like AWS Lambda.

Our experience has shown that enterprises that diligently set up and manage spend-to-budget by workload have more control of their costs and ultimately, spend less on their cloud environments without sacrificing user experience.

 

–Timothy Hill, Senior Product Manager, 2nd Watch

Facebooktwittergoogle_pluslinkedinmailrss

Managing AWS Billing

Without a doubt, AWS has fundamentally changed how modern enterprises deploy IT infrastructure.  Their services are flexible, cost effective, scalable, secure and reliable. And while moving from on-premise data centers to the cloud is, in most cases, the smart move; once there managing your costs becomes much more complex.

On-premise costs are straight forward, enterprises purchase servers and amortize their costs over the expected life.  Shared services such as internet access, racks, power and cooling are proportionally allocated to the cost of each server. AWS on the other hand, invoices each usage type separately.  For example, if you are running a basic EC2 instance, you will not only be charged for the EC2 box usage but also the data transfer, EBS Storage and associated snapshots. You could end up with as many as 13 line items of cost for a single EC2.

Example: Pricing line items for a single c4.xlarge Linux virtual machine running in the US East Region (Click on image to view larger)

Linux Example_Managing AWS Billing

When examining the composition of various workload types the numbers of line items to manage will vary. A traditional VM-based workload may have 50 cost line items for every $1,000 of spend while an agile, cloud-native workload may have as many as 500 per $1,000 and a dynamic workload leveraging spot instances may have upwards of 1,200 per $1,000. This “parts bin” approach to pricing makes the job of cost account challenging.

To address this complexity and enable accurate cost accounting of your cloud costs; we recommend creating a business-relevant financial tagging schema to organize your resources and associated cost line items based on your specific financial accounting structure.

Here are some recommended financial management tags you should consider (Click on image to view larger):

Financial Management Tags_AWS Billing Management

AWS Tagging data integrity is extremely important in ensuring the quality of the information it provides and is directly dependent upon the rigor applied in adopting a systematic and disciplined approach to AWS Tagging.

Financial Management Tagging – Best Practices

  • Create a framework or standard for your enterprise that outlines required tag names, tag formatting rules, and governance of tags.
  • Tags should be enforced and automated at startup of the resource via Cloud Formation templates or other infrastructure as code tools, such as Terraform, to ensure cost accounting details are captures from time of launch.
    • NOTE:  Tags are point in time based.  If a resource is launched without being tagged and then tagged sometime in the future, all hours the resource ran prior to being tagged will not be included in tag reports in the AWS console.
  • Manually creating tags and associated values is strongly discouraged as it leads to miss-tagged and untagged resources and in-accurate cost accounting
  • Select all upper case or all lower-case keys and values to avoid discrepancies with capitalization.
    • NOTE: “Production” and “production” are considered two different tag names or values.
  • Monitor resources with AWS Config Rules and alert for newly created resources that are not tagged

Once your tagging schema is created, automation is in place to tag resources during startup and alerts are set up to ensure tagging is managed, you can accurately to view, track and report your cost and usage using any of your tagging dimensions.

Financial Management Reporting – Best Practices

  • Using your tagging schema, group your resources by workload.
  • Apply Reserved Instance discounts to the workloads you intended them to be for.
    • NOTE: 2nd Watch’s CMP Finance Manager tool converts reserved instances into resources so that you can add them to the workload they were intended for.
  • Organize your groups to match your specific multi-level financial reporting structure.
  • Managed shared resources
    • Create groups for shared resources. If you have resources that are shared across multiple workloads such as a database used my multiple applications or virtual machines with more than one applications running on it, create groups to capture these costs and allocate them proportionally to the applications using them.
  • Manage un-taggable resources
    • Create a group for un-taggable resources. Some AWS resources are not taggable and should be grouped together and their associated costs proportionally allocated to all applications.
  • Manage spend to budget
    • Create budgets and budget alerts for each group to ensure you stay in budget throughout the year.
    • Key alerts
      • Forecasted month end cost exceeds alert threshold
      • MTD cost is over alert threshold
      • Forecasted year end cost exceeds alert threshold
      • YTD cost is over alert threshold
    • Sign up to receive monthly cost and usage reports for integration into your internal cost accounting system.
      • Cost by application, environment, business unit etc.

 

Even though AWS’ “parts bin” approach to pricing is complicated, following these guidelines will help ensure accurate cost accounting of your cloud spend.

 

–Timothy Hill, Senior Product Manager, 2nd Watch

 

 

Facebooktwittergoogle_pluslinkedinmailrss