We recently took a DevOps poll of 1,000 IT professionals to get a pulse for where the industry sits regarding the adoption and completeness of vision around DevOps. The results were pretty interesting, and overall we are able to deduce that a large majority of the organizations who answered the survey are not truly practicing DevOps. Part of this may be due to the lack of clarity on what DevOps really is. I’ll take a second to summarize it as succinctly as possible here.
DevOps is the practice of operations and development engineers participating together in the entire service lifecycle, from design through the development process to production support. This includes, but is not limited to, the culture, tools, organization, and practices required to accomplish this amalgamated methodology of delivering IT services.
In order to practice DevOps you must be in a DevOps state of mind and embrace its values and mantras unwaveringly.
The first thing that jumped out at me from our survey was the responses to the question “Within your organization, do separate teams manage infrastructure/operations and application development?” 78.2% of respondents answered “Yes” to that question. Truly practicing DevOps requires that the infrastructure and applications are managed within the context of the same team, so we can deduce that at least 78.2% of the respondents’ companies are not truly practicing DevOps. Perhaps they are using some infrastructure-as-code tools, some forms of automation, or even have CI/CD pipelines in place, but those things alone do not define DevOps.
Speaking of infrastructure-as-code… Another question, “How is your infrastructure deployed and managed?” had nearly 60% of respondents answering that they were utilizing infrastructure-as-code tools (e.g. Terraform, Configuration Management, Kubernetes) to manage their infrastructure, which is positive, but shows the disconnect between the use of DevOps tools and actually practicing DevOps (as noted in the previous paragraph). On the other hand, just over 38% of respondents indicated that they are managing infrastructure manually (e.g. through the console), which means not only are they not practicing DevOps they aren’t even managing their infrastructure in a way that will ever be compatible with DevOps… yikes. The good news is that tools like Terraform allow you to import existing manually deployed infrastructure where it can then be managed as code and handled as “immutable infrastructure.” Manually deploying anything is a DevOps anti-pattern and must be avoided at all costs.
Aside from infrastructure we had several questions around application development and deployment as it pertains to DevOps. Testing code appears to be an area where a majority of respondents are staying proactive in a way that would be beneficial to a DevOps practice. The question “What is your approach to writing tests?” had the following breakdown on its answers:
We don’t really test: 10.90%
We get to it if/when we have time: 15.20%
We require some percentage of code to be covered by tests before it is ready for production: 32.10%
We require comprehensive unit and integration testing before code is pushed to production: 31.10%
Rigid TDD/BDD/ATDD/STDD approach – write tests first & develop code to meet those test requirements: 10.70%
We can see that around 75% of respondents are doing some form of consistent testing, which will go a long way in helping build out a DevOps practice, but a staggering 25% of respondents have little or no testing of code in place today (ouch!). Another question “How is application code deployed and managed?” shows that around 30% of respondents are using a completely manual process for application deployment and the remaining 70% are using some form of an automated pipeline. Again, the 70% is a positive sign for those wanting to embrace DevOps, but there is still a massive chunk at 30% who will have to build out automation around testing, building, and deploying code.
Another important factor in managing services the DevOps way is to have all your environments mirror each other. In response to the question “How well do your environments (e.g. dev, test, prod) mirror one another?” around 28% of respondents indicated that their environments are managed completely independently of each other. Another 47% indicated that “they share some portion of code but are not managed through identical code bases and processes,” and the remaining 25% are doing it properly by “managed identically using same code & processes employing variables to differentiate environments.” Lots of room for improvement in this area when organizations decide they are ready to embrace the DevOps way.
Our last question in the survey was “How are you notified when an application/process/system fails?” and I found the answers a bit staggering. Over 21% of respondents indicated that they are notified of outages by the end user. It’s pretty surprising to see that large of a percentage utilizing such a reactionary method of service monitoring. Another 32% responded that “someone in operation is watching a dashboard,” which isn’t as surprising but will definitely be something that needs to be addressed when shifting to a DevOps approach. Another 23% are using third-party tools like NewRelic and Pingdom to monitor their apps. Once again, we have that savvy ~25% group who are currently operating in a way that bodes well for DevOps adoption by answering “Monitoring is built into the pipeline, apps and infrastructure. Notifications are sent immediately.” The twenty-five-percenters are definitely on the right path if they aren’t already practicing DevOps today.
In summary, we have been able to deduce from our survey that, at best, around 25% of the respondents are actually engaging in a DevOps practice today. For more details on the results of our survey, download our infographic.
-Ryan Kennedy, Principal Cloud Automation Architect
Governance, Risk and Compliance (GRC) is a standard framework that helps to drive organizations towards a common set of goals and principals. The overarching theme is strategically focused on how technology utilization and operations tie directly back to an organization’s business goals and, in many cases, aspirations.
There are many facets to GRC. In the cloud it means the same thing as it did in the datacenter. We need to ensure IT organizes around the business, and we need to make sure risk is minimized and compliance is maintained.
At 2nd Watch we work with clients across all areas of GRC. Clients take various levels of focus in each area, and some areas are more important based on the vertical the client is operating in.
The cloud extends beyond the physical bounds of an organization, and with that institutes new challenges and requires a shared cloud responsibility model. The CSP is responsible for the underlying infrastructure setup and physical maintenance of their cloud infrastructure. We work with our cloud ISV and providers’ tools, technologies and best practices to help maintain strong governance and lower risk while meeting compliance.
The landscape of software, tools and solutions to support governance, risk and compliance is daunting in the cloud marketplace. 2nd Watch focuses on providing a holistic support to our clients around GRC. We believe there are fantastic capabilities directly inside the cloud management portals to help customers along the journey to strong GRC framework and institution.
In Microsoft Azure we can utilize Compliance Manager. Compliance Manager is a workflow-based assessment tool that enables organizations to track, assign and verify regulatory compliance procedures and activities in support of Microsoft Cloud technologies – including Office 365 and Dynamics. It supports ISO 27001, IS0 27018 and NIST and supports regulatory compliance around HIPAA and GDPR. It is a foundational tool to utilize within Microsoft Azure to help you along the path to achieving strong governance, risk and compliance around Microsoft Cloud technologies.
With Amazon Web Services we have a complete set of core cloud operations management tools to utilize within the AWS console to help us bolster governance and security and reduce risk. Amazon provides resources with a full prescriptive set of compliance quick reference guides, which provide an overview of how to maintain a cloud compliant environment through strong security and controls validation, and insight and monitoring for activity and security assurance.
Amazon has a complete Cloud Compliance Center where clients can tap into an abundant set of resources to help along the way.
Beyond the tools, both Microsoft Azure and AWS provide strategic support with partners around compliance. There are many accelerators and programs that organizations can request from and Amazon and Microsoft to help them achieve and maintain GRC specifically tuned to the cloud.
GRC is unique to each organization. Cloud providers bring a substantial set of resources and technologies, along with great prescriptive guidance and best practices to help and guide you in achieving a strategic GRC framework and set of processes and procedures in your organization.
Take advantage of these built-in capabilities as you start to look at other tools and technologies to complete your holistic approach to governance, risk and compliance, and please reach out to 2nd Watch to find out how we can support you along the way.
With increased focus on security and governance in today’s digital economy, I want to highlight a simple but important use case that demonstrates how to use AWS Identity and Access Management (IAM) with Security Token Service (STS) to give trusted AWS accounts access to resources that you control and manage.
Security Token Service is an extension of IAM and is one of several web services offered by AWS that does not incur any costs to use. But, unlike IAM, there is no user interface on the AWS console to manage and interact with STS. Rather all interaction is done entirely through one of several extensive SDKs or directly using common HTTP protocol. I will be using Terraform to create some simple resources in my sandbox account and .NET Core SDK to demonstrate how to interact with STS.
The main purpose and function of STS is to issue temporary security credentials for AWS resources to trusted and authenticated entities. These credentials operate identically to the long-term keys that typical IAM users have, with a couple of special characteristics:
They automatically expire and become unusable after a short and defined period of time elapses
They are issued dynamically
These characteristics offer several advantages in terms of application security and development and are useful for cross-account delegation and access. STS solves two problems for owners of AWS resources:
You do not need to distribute access keys to external entities or store them within an application
One common scenario where STS is useful involves sharing resources between AWS accounts. Let’s say, for example, that your organization captures and processes data in S3, and one of your clients would like to push large amounts of data from resources in their AWS account to an S3 bucket in your account in an automated and secure fashion.
While you could create an IAM user for your client, your corporate data policy requires that you rotate access keys on a regular basis, and this introduces challenges for automated processes. Additionally, you would like to limit the distribution of access keys to your resources to external entities. Let’s use STS to solve this!
To get started, let’s create some resources in your AWS cloud. Do you even Terraform, bro?
Let’s create a new S3 bucket and set the bucket ACL to be private, meaning nobody but the bucket owner (that’s you!) has access. Remember that bucket names must be unique across all existing buckets, and they should comply with DNS naming conventions. Here is the Terraform HCL syntax to do this:
Great! We now have a bucket… but for now, only the owner can access it. This is a good start from a security perspective (i.e. “least permissive” access).
What an empty bucket may look like
Let’s create an IAM role that, once assumed, will allow IAM users with access to this role to have permissions to put objects into our bucket. Roles are a secure way to grant trusted entities access to your resources. You can think about roles in terms of a jacket that an IAM user can wear for a short period of time, and while wearing this jacket, the user has privileges that they wouldn’t normally have when they aren’t wearing it. Kind of like a bright yellow Event Staff windbreaker!
For this role, we will specify that users from our client’s AWS account are the only ones that can wear the jacket. This is done by including the client’s AWS account ID in the Principal statement. AWS Account IDs are not considered to be secret, so your client can share this with you without compromising their security. If you don’t have a client but still want to try this stuff out, put your own AWS account ID here instead.
Great, now we have a role that our trusted client can wear. But, right now our client can’t do anything except wear the jacket. Let’s give the jacket some special powers, such that anyone wearing it can put objects into our S3 bucket. We will do this by creating a security policy for this role. This policy will specify what exactly can be done to S3 buckets that it is attached to. Then we will attach it to the bucket we want our client to use. Here is the Terraform syntax to accomplish this:
A couple things to note about this snippet – First, we are using Terraform interpolation to inject values from previous terraform statements into a couple of places in the policy – specifically the ARN from the role and bucket we created previously. Second, we are specifying a condition for the s3 policy – one that requires a specific object ACL for the action s3:PutObject, which is accomplished by including the HTTP request header x-amz-acl to have a value of bucket-owner-full-control with the PUT object request. By default, objects PUT in S3 are owned by the account that created them, even if it is stored in someone else’s bucket. For our scenario, this condition will require your client to explicitly grant ownership of objects placed in your bucket to you, otherwise the PUT request will fail.
So, now we have a bucket, a policy in place on our bucket, and a role that assumes that policy. Now your client needs to get to work writing some code that will allow them to assume the role (wear the jacket) and start putting objects into your bucket. Your client will need to know a couple of things from you before they get started:
The bucket name and the region it was created in (the example above created a bucket named d4h2123b9-xaccount-bucket in us-west-2)
The ARN for the role (Terraform can output this for you). It will look something like this but will have your actual AWS Account ID: arn:aws:iam::123456789012:role/sts-delegate-role
They will also need to create an IAM User in their account and attach a policy allowing the user to assume roles via STS. The policy will look similar to this:
Let’s help your client out a bit and provide some C# code snippets for .NET Core 2.0 (available for Windows, macOS and LinuxTo get started, install the .NET SDK for your OS, then fire up a command prompt in a favorite directory and run these commands:
The first command will create a new console app in the subdirectory s3cli. Then switch context to that directory and import the AWS SDK for .NET Core, and then add packages for SecurityToken and S3 services.
Once you have the libraries in place, fire up your favorite IDE or text editor (I use Visual Studio Code), then open Program.cs and add some code:
This snippet sends a request to STS for temporary credentials using the specified ARN. Note that the client must provide IAM user credentials to call STS, and that IAM user must have a policy applied that allows it to assume a role from STS.
This next snippet takes the STS credentials, bucket name, and region name, and then uploads the Program.cs file that you’re editing and assigns it a random key/name. Also note that it explicitly applies the Canned ACL that is required by the sts-delegate-role:
So, to put this all together, run this code block and make the magic happen! Of course, you will have to define and provide proper variable values for your environment, including securely storing your credentials.
Try it out from the command prompt:
If all goes well, you will have a copy of Program.cs in the bucket. Not very useful itself, but it illustrates how to accomplish the task.
What a bucket with something in it may look like
Here is a high-level document of what we put together:
Putting it all together
Your client uses their IAM user to call AWS STS and requests the role ARN you gave them
STS authenticates the client’s IAM user and verifies the policy for the ARN role, then issues a temporary credential to the client.
The client can use the temporary credentials to access your S3 bucket (they will expire soon), and since they are now wearing the Event Staff jacket, they can successfully PUT stuff in your bucket!
There are many other use-cases for STS. This is just one very simplistic example. However, with this brief introduction to the concepts, you should now have a decent idea of how STS works with IAM roles and policies, and how you can use STS to give access to your AWS resources for trusted entities. For more tips like this, contact us.
I love an all you can eat buffet. One can get a ton of value from a lot to choose from, and you can eat as much as you want or not, for a fixed price.
In the same regards, I love the freedom and vast array of technologies that the cloud allows you. A technological all you can eat buffet, if you will. However, there is no fixed price when it comes to the cloud. You pay for every resource! And as you can imagine, it can become quite costly if you are not mindful.
So, how do organizations govern and ensure that their cloud spend is managed efficiently? Well, in Microsoft’s Azure cloud you can mitigate this issue using Azure resource policies.
Azure resource policies allow you to define what, where or how resources are provisioned, thus allowing an organization to set restrictions and enable some granular control over their cloud spend.
Azure resource policies allow an organization to control things like:
Where resources are deployed – Azure has more than 20 regions all over the world. Resource policies can dictate what regions their deployments should remain within.
Virtual Machine SKUs – Resource policies can define only the VM sizes that the organization allows.
Azure resources – Resource policies can define the specific resources that are within an organization’s supportable technologies and restrict others that are outside the standards. For instance, your organization supports SQL and Oracle databases but not Cosmos or MySQL, resource policies can enforce these standards.
OS types – Resource policies can define which OS flavors and versions are deployable in an organization’s environment. No longer support Windows Server 2008, or want to limit the Linux distros to a small handful? Resource policies can assist.
Azure resource policies are applied at the resource group or the subscription level. This allows granular control of the policy assignments. For instance, in a non-prod subscription you may want to allow non-standard and non-supported resources to allow the development teams the ability to test and vet new technologies, without hampering innovation. But in a production environment standards and supportability are of the utmost importance, and deployments should be highly controlled. Policies can also be excluded from a scope. For instance, an application that requires a non-standard resource can be excluded at the resource level from the subscription policy to allow the exception.
A number of pre-defined Azure resource policies are available for your use, including:
Allowed locations – Used to enforce geo-location requirements by restricting which regions resources can be deployed in.
Allowed virtual machine SKUs – Restricts the virtual machines sizes/ SKUs that can be deployed to a predefined set of SKUs. Useful for controlling costs of virtual machine resources.
Enforce tag and its value – Requires resources to be tagged. This is useful for tracking resource costs for purposes of department chargebacks.
Not allowed resource types – Identifies resource types that cannot be deployed. For example, you may want to prevent a costly HDInsight cluster deployment if you know your group would never need it.
Azure also allows custom resource policies when you need some restriction not defined in a custom policy. A policy definition is described using JSON and includes a policy rule.
This JSON example denies a storage account from being created without blob encryption being enabled:
There is a feature in the Linux Kernel that is relevant to VM’s hosted on Xen servers that is called the “steal percentage.” When the OS requests from the host system’s use of the CPU and the host CPU is currently tied up with another VM, the Xen server will send an increment to the guest Linux instance which increases the steal percentage. This is a great feature as it shows exactly how busy the host system is, and it is a feature available on many instances of AWS as they host using Xen. It is actually said that Netflix will terminate an AWS instance when the steal percentage crosses a certain threshold and start it up again, which will cause the instance to spin up in a new host server as a proactive step to ensure their system is utilizing their resources to the fullest.
What I wanted to discuss here is that it turns out there is a bug in the Linux kernel versions 4.8, 4.9 and 4.10 where the steal percentage can be corrupted during a live migration on the physical Xen server, which causes the CPU utilization to be reported as 100% by the agent.
When looking at Top you will see something like this:
As you can see in the screen shot of Top, the %st metric on the CPU(s) line shows an obviously incorect number.
During a live migration on the physical Xen server, the steal time gets a little out of sync and ends up decrementing the time. If the time was already at or close to zero, itcauses the time to become negative and, due to type conversions in the code, it causes an overflow.
CloudWatch’s CPU Utilization monitor calculates that utilization by adding the System and User percentages together. However, this only gives a partial view into your system. With our agent, we can see what the OS sees.
That is the Steal percentage spiking due to that corruption. Normally this metric could be monitored and actioned as desired, but with this bug it causes noise and false positives. If Steal were legitimately high, then the applications on that instance would be running much slower.
There is some discussion online about how to fix this issue, and there are some kernel patches to say “if the steal time is less than zero, just make it zero.” Eventually this fix will make it through the Linux releases and into the latest OS version, but until then it needs to be dealt with.
We have found that a reboot will clear the corrupted percentage. The other option is to patch the kernel… which also requires a reboot. If a reboot is just not possible at the time, the only impact to the system is that it makes monitoring the steal percentage impossible until the number is reset.
It is not a very common issue, but due to the large number of instances we monitor here at 2nd Watch, it is something that we’ve come across frequently enough to investigate in detail and develop a process around.
If you have any questions as to whether or not your servers hosted in the cloud might be effected by this issue, please contact us to discuss how we might be able to help.
Welcome to Full Disclosure, our new video blog series with expert technical tips and tricks for navigating the world of the cloud.
In our first vlog we discuss what DevOps is, what the origins of the method and term are and how you can start. At 2nd Watch, we define DevOps as a cultural change. Learn more about the myths of DevOps with Lars Cromley, 2nd Watch Director of Engineering.
Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.
In practice, this means that Terraform allows you to declare what you want your infrastructure to look like – in any cloud provider – and will automatically determine the changes necessary to make it so. Because of its simple syntax and cross-cloud compatibility, it’s 2nd Watch’s choice for infrastructure as code.
Pain You May Be Experiencing Working With Terraform
When you have multiple collaborators (individ
uals, teams, etc.) working on a Terraform codebase, some common problems are likely to emerge:
Enforcing peer review becomes difficult. In any codebase, you’ll want to ensure that your code is peer reviewed in order to ensure better quality in accordance with The Second Way of DevOps: Feedback. The role of peer review in IaC codebases is even more important. IaC is a powerful tool, but that tool has a double-edge – we are clearly more productive for using it, but that increased productivity also means that a simple typo could potentially cause a major issue with production infrastructure. In order to minimize the potential for bad code to be deployed, you should require peer review on all proposed changes to a codebase (e.g. GitHub Pull Requests with at least one reviewer required). Terraform’s open source offering has no facility to enforce this rule.
Terraform plan output is not easily integrated in code reviews. In all code reviews, you must examine the source code to ensure that your standards are followed, that the code is readable, that it’s reasonably optimized, etc. In this aspect, reviewing Terraform code is like reviewing any other code. However, Terraform code has the unique requirement that you must also examine the effect the code change will have upon your infrastructure (i.e. you must also review the output of a terraform plan command). When you potentially have multiple feature branches in the review process, it becomes critical that you are assured that the terraform plan output is what will be executed when you run terraform apply. If the state of infrastructure changes between a run of terraform plan and a run of terraform apply, the effect of this difference in state could range from inconvenient (the apply fails) to catastrophic (a significant production outage). Terraform itself offers locking capabilities but does not provide an easy way to integrate locking into a peer review process in its open source product.
Too many sets of privileged credentials. Highly-privileged credentials are often required to perform Terraform actions, and the greater the number principals you have with privileged access, the higher your attack surface area becomes. Therefore, from a security standpoint, we’d like to have fewer sets of admin credentials which can potentially be compromised.
What is Atlantis?
And what is Atlantis? Atlantis is an open source tool that allows safe collaboration on Terraform projects by making sure that proposed changes are reviewed and that the proposed change is the actual change which will be executed on your infrastructure. Atlantis is compatible (at the time of writing) with GitHub and Gitlab, so if you’re not using either of these Git hosting systems, you won’t be able to use Atlantis.
How Atlantis Works With Terraform
Atlantis is deployed as a single binary executable with no system-wide dependencies. An operator adds a GitHub or GitLab token for a repository containing Terraform code. The Atlantis installation process then adds hooks to the repository which allows communication to the Atlantis server during the Pull Request process.
You can run Atlantis in a container or a small virtual machine – the only requirement is that the Terraform instance can communicate with both your version control (e.g. GitHub) and infrastructure (e.g. AWS) you’re changing. Once Atlantis is configured for a repository, the typical workflow is:
A developer creates a feature branch in git, makes some changes, and creates a Pull Request (GitHub) or Merge Request (GitLab).
The developer enters atlantis plan in a PR comment.
Via the installed web hooks, Atlantis locally runs terraform plan. If there are no other Pull Requests in progress, Atlantis adds the resulting plan as a comment to the Merge Request.
If there are other Pull Requests in progress, the command fails because we can’t ensure that the plan will be valid once applied.
The developer ensures the plan looks good and adds reviewers to the Merge Request.
Once the PR has been approved, the developer enters atlantis apply in a PR comment. This will trigger Atlantis to run terraform apply and the changes will be deployed to your infrastructure.
The command will fail if the Merge Request has not been approved.
The following sequence diagram illustrates the sequence of actions described above:
Atlantis sequence diagram
We can see how our pain points in Terraform collaboration are addressed by Atlantis:
In order to ensure that your terraform plan accurately reflects the change to your infrastructure that will be made when you run terraform apply, Atlantis performs locking on a project or workspace basis: https://github.com/runatlantis/atlantis#locking
In order to prevent creating multiple sets of privileged credentials, you can deploy Atlantis to run on an EC2 instance with a privileged IAM role in its instance profile (e.g. in AWS). In this way, all of your Terraform commands run through a single set of privileged credentials and obviate the need to distribute multiple sets of privileged credentials: https://github.com/runatlantis/atlantis#aws-credentials
You can see that with minimal additional infrastructure you can establish a safe and reliable CI/CD pipeline for your infrastructure as code, enabling you to get more done safely! To find out how you can deploy a CI/CD pipeline in less than 60 days, Contact Us.
In my previous blog I gave a fairly high-level overview of what automated AWS account management could (or rather should) entail. This blog will drill deeper into the processes and give you some real-world code samples of what this looks like.
AWS Organizations and Linked Account Creation:
As mentioned in my last blog, AWS recently announced the general availability of AWS Organizations, allowing you to create linked or nested AWS accounts under a master account and apply policy-based management under the umbrella of the root account. It also allows for hierarchical management (up to five levels deep) of linked accounts by Organizational Units (OU). Policies can be applied at the global level, OU level, and individual account level. It is important to note that conflicting policies always defer to the parent entities permission set. Meaning an IAM user/role in account may have permissions to perform some action, but, if at the Organizations level the account, OU, or global settings deny those actions, the resulting action for the IAM resource will be denied. Likewise, the effective permissions for a resource are a union of the resource’s direct permissions assigned in IAM and the permissions that are controlled by Organizations. This means you can lock linked accounts down to do things like “only manage DNS Route53 resources” or “only manage S3 resources” using Organizations policies. Pretty nice way of segmenting off security and reducing the potential blast radius.
I am going to pick the most common denominator for my following examples… AWS CLI. Though I rarely use it for actual automation code, I figure most folks are familiar with it and it has a pretty intuitive syntax.
Step 1: Enable Organizations on your root account
Ensure that your AWS Profile environment variable is set to your desired root account AWS profile that has the necessary permissions to work with AWS Organizations. Alternatively, if you don’t want to use an environment variable, you can either ensure the default AWS Profile is the one which has permissions on your root account or you can specify the –profile argument when typing your AWS CLI commands. I’m going to use the AWS_DEFAULT_PROFILE environment variable in my examples here (output redacted).
> export AWS_DEFAULT_PROFILE=myrootacctadmin
This of course assumes you have a profile set up under your HOME dir in the .aws/credentials file named myrootacctadmin.
Now that we have our environment set we can get on with running the AWS CLI commands to create our organization.
Let’s be safe and make sure we don’t already have an organization created under our root account:
$ aws organizations list-roots
$ aws organizations list-roots
An error occurred (AWSOrganizationsNotInUseException) when calling the ListRoots operation: Your account is not a member of an organization.
As the error message indicates, this account is not currently a part of any organization and will need to be configured to use organizations if we want to use this as our master account and create linked accounts underneath it.
Indeed! our myrootacctadmin account is listed as the root (i.e. master) of our entire organization. This is exactly what we wanted. Now let’s see what AWS accounts are identified as part of this organization…
The actual creation of the account is not instantaneous, and the API responds to the create-account call before the new account creation is complete. While it is pretty quick to complete, unless we ensure that it is completed before performing any additional automation against it, we may receive an error from the API indicating the account is not yet ready. So prior to performing additional configuration on the new account, we need to ensure the State has reached SUCCEEDED. You will generally just loop until the State is equal to SUCCEEDED in your automation code before moving on to the next step. Also, it might be a good idea to catch failures (e.g. State == “FAILED”) and handle those gracefully. The account creation status can be achieved as follows:
Congratulations! You’ve just enabled AWS Organizations and created your first linked account!
At this point you should have a couple of emails from AWS in the inbox of the email address used to create the new account. They are standard boiler-plate emails. One of which is a “Welcome to Amazon Web Services” email and the other tells you that your account is ready and has some “getting started” type links.
Step 3: Reset New Linked Account Root Password
Now that your linked account has been created you will need to go through the AWS Reset Root Account Password workflow to make your new account accessible from either the AWS Web Console or the AWS APIs. The recommended approach here is to reset the root account password, enable MFA, Create an IAM user with Administrator privileges, store the root account secrets in a VERY secure place, and only use them as a last resort for account access.
Here’s a shortened URL that will take you directly to the root account password reset page: http://amzn.pw/45Nxe
Step 4: (Optionally) Create Organizational Units
Let’s go through a couple of examples of Organizational Units.
OU for only allowing S3 services
OU for only allowing services in us-west-2 and us-east-1 regions
“What if I want to bring my existing accounts under the umbrella of Organizations?” you ask
Good news! You can invite existing AWS accounts to join your organization. Using the API you can issue an invitation to an existing account by Account ID, Email, or Organization. For the sake of simplicity, let’s use an Account ID (222222222222) for the following example (again, using the root/master account AWS profile):
A couple of things of note – The handshake Id is what will be required to accept the invitation on the linked account side. Notice the difference between the RequestedTimestamp (epoch 1524610827.55) and the ExpirationTimestamp (epoch 1525906827.55). 1296000 seconds. Divide that by 86400 seconds in a day and we get 15 days.
At this point you have 15 days to issue an acceptance of the invitation (aka: handshake), from the target AWS account. You could simply log in to the AWS Web Console, navigate to Organizations, and accept the invitation, but that’s not what this article is about now is it? We’re talking automation here! And, as all good DevOpsers know, we utilize security entities that employ PoLP (Principal of Least Privilege) to perform process-specific tasks.
This means we aren’t going to do something ludicrous like adding AWS Access Keys to our root account login (please don’t ever do this). Nor are we going to create an IAM User with Administrator access for this very specific task. You can either create a User or a Role in the target account to accept the handshake, although, creating a Role will require you to assume that Role using STS, which might be overkill. On the other hand, you might use a lambda function to automate the handshake in which case you most certainly would utilize an IAM Role. Either way, the following IAM Policy Document will provide the User/Role with the required permissions to accept (or delete) the invitation:
Compliance is a constant challenge today. Keeping our system images in a healthy and trusted state of compliance requires time and effort. There are millions of tools and technologies in market to help customers maintain compliance and state, so where do I start?
Amazon has built a rich set of core technologies within the Amazon Web Services console. Systems Manager is a fantastic operations management platform tool that can assist you with setting up and maintaining configuration and state management.
One of the first things we must focus on when we build out our core images in the cloud is the configuration of those images. What is the role of the image, what operating system am I going to utilize and what applications and/or core services do I need to enable, configure and maintain? In the datacenter, we call these Gold Images. The same applies in the cloud.
We define these roles for our images and we place them in different functional areas – Infrastructure, Web Services, Applications. We may have many core image templates for our enterprise workloads, by building these base images and maintaining them continuously – we set in motion a solid foundation for core security and core compliance of our cloud environment.
Amazon Systems Manager looks across my cloud environment and allows me to bring together all the key information around all my operating resources in the cloud. It allows me to centralize the gathering of all core baseline information for my resources in one place. In the past I would have had to look at my AWS CloudWatch information in one area, my AWS CloudTrail information in another area and my configuration information in yet another area. Centralizing this information in one console allows you to see the holistic state of your cloud environment baselines in one console.
AWS Systems Manager provides built-in Insight and Dashboards that allow you to look across your entire cloud environment and see into and act upon your cloud resources. AWS Systems Manager allows you to see the configuration compliance of all your resources as well as the state management and associations across your resources. It provides a rich ability to customize configuration and state management for your workloads, applications and resource types and scan and analyze to ensure those configuration and states are maintained continuously. With AWS Systems Manager you can customize and create your own compliance types to marry to your Enterprise Organizational baseline of your company’s business requirements. With that in place, I can constantly scan and analyze against these compliance baselines to ensure and maintain the operational configuration and state always.
We analyze and report on the current state and quickly determine compliance or out of compliance state centrally for our cloud services and resources. We can create base reports around our compliance position at any time, and with this knowledge, we can set in motion remediation to return our services and resources back to a compliant state and configuration.
With Amazon Systems Manager we can scan all resources for patch state, determine what patches are missing and manually, scheduled or automate the remediation of those patches to maintain patch management compliance.
Amazon Systems Manager also integrates with Chef InSpec, allowing you to leverage Chef InSpec to operate in a continuous compliance framework for your cloud resources.
On the road to compliance it is important to flex the tools and capabilities of your Cloud Provider. Amazon gives us a rich set of Systems Management capabilities across configuration, state management, patch management and remediation, as well as reporting. Amazon Systems Manager is provided at no cost to Amazon customers and will help you along your Journey to realizing continuous compliance of your cloud environment across the Amazon Cloud and the Hybrid Cloud. To learn more about using Amazon Systems Manager or your systems’ compliance, contact us.
What I am talking about here is the automation of the following:
AWS Linked Account Creation (the creation of secondary accounts under a single master account)
Account Initialization and Configuration
It is commonplace for organizations to manage their AWS assets/resources across a wide range of different AWS accounts. This is nothing new, and we’ve seen some of our customers scale this into the hundreds. This has some pretty obvious implications from an operational, security, and accounting standpoint.
AWS Linked Account Creation
First there is the creation of the linked account itself, which can be a time consuming and arduous (if at least only one-time) process. Even if you have a rigid process for this, it is inevitable that some human error will introduce some drift or inconsistency at some point in time. It’s not a matter of if, just a matter of when. There is also the tracking of the root account credentials and everything that goes along with that. Looks like another process that is ripe for some sweet, sweet automation. Until very recently there was no API available for this, but AWS released a beta API to create linked accounts around a year ago that has recently gone to general availability. So score one for automation!
Account Initialization and Configuration
Now you’ve got your shiny new linked account. but for every account you manage you have to ensure that all of your base settings and resources are properly set up (e.g. AWS CloudTrail, AWS Config, IAM password policies, SAML Federation with your central AD, on and on). Not only set up, but set up in a consistent way so that you don’t have drift between accounts. Ok, so you could put together a nice CloudFormation template (CFT), Manage it in Terraform, or possibly just a homegrown set of scripts (bash+AWSCL, python, ruby, etc.). Those are all a great start, but you still need to be able to audit those resources to ensure they are what they are supposed to be. Also, you need to support the ability to push changes to those resources.
A few examples…
IT AD Admin: The ADFS servers are updating their XML metadata doc, so we need you to go update the ADFS SAML Federation for our 37 linked accounts.
IT Security Admin: We need to actively manage our set of IAM Roles that map to ADFS groups and their respective permissions on a regular and ongoing basis. How are we going to quickly and consistently do that across our 37 linked accounts?
IT Security Admin: Hey, our email address for AWS CloudTrail notifications (SNS subscription) needs to be updated to use a new email address. I need you to get that updated on all of our 37 linked accounts ASAP!
And on and on it goes. Suffice it to say, there is a never-ending need to be able to make modifications across one, several, or all of your linked AWS accounts. You need an approach for handling what would normally be an unwieldy and tedious bit of guaranteed work. The more human intervention required to manage these things. the more likely we are to see inconsistencies, errors, and misses. And we’ve seen enough cautionary tales on failed security practices in the news in the past few years that I don’t need to stress the importance of getting this stuff right. Every time. All the time.
Once you have these things configured you really need a way to continually audit those resources and settings on an ongoing basis and ideally be able to automatically respond to drift events. This one is a bit trickier than the others because, while you can use tools like CloudFormation or Terraform to set up your initial settings and configurations. the resources they create can be modified afterwards outside of the tool they were created/configured with in the first place. Tools like AWS CloudTrail and AWS Config provide valuable tracking information for helping audit resources but alone don’t solve this puzzle. Especially if you are talking about managing this across a few dozen accounts. Something more robust must be employed to collect that data and do something intelligent with it.
How do I escape this sort of multi-account management nightmare you are describing?!!
I’ll be going into a deeper dive into this in my next blog, but here is a high-level overview of the architecture and accompanying tools and technologies you can put in place to pull it off.
AWS Linked Account Creation
With the somewhat recent release of the organizations API this has become a reality. As per the CreateAccount API documentation. you will need to ensure that AWS Organizations is enabled in the master account. But fear not! You probably already are. Specifically if you are already running multiple accounts under a master account, then you most certainly are. I won’t bore you with details, and AWS already has a very nice article detailing the process required to use organizations and the API to automate account creation. Pretty spiffy!
Account Initialization and Configuration
Once you have created the linked account using the CreateAccount API the next step is to apply any and all org-specific initialization and configuration to the new account to get it all ready for action. This step and the Continuous Compliance step can also be managed by the same tool if that is how you decide to architect it.
The key is that this is where we initialize the account with its base configuration. Whether you do that with custom scripting/code, CloudFormation, Terraform, or some amalgamation of those and/or other tools/services is not of paramount importance. What is important, is having a way to track those resource and their state. Make sure you keep that in mind when architecting a solution. One nice thing about CloudFormation is that the state tracking is built right into the service itself. You can easily list all resources within a CloudFormation stack and you can include stack Outputs to track any custom data you may generate or derive during the CFT stack launch.
You could do something similar with Terraform through the use of their state files, but it (non-enterprise Terraform) lacks the same API queryability that CloudFormation has built in. Also, it is less transparent to the casual onlooker in the AWS console where resources are originating from. Of course, once you query the resources you will still require a method for determining and tracking the state of those resources. But now we’re getting ahead of ourselves.
This is going to require a service that will allow you to: – Track the state of resources we care about – Audit the state of those resources automatically on an ongoing basis – Report on any configuration drift – Optionally automatically remediate drift.
Using the AWS CloudTrail and AWS Config services gives us the ability to track changes real-time and tie those changes to a specific user/role. But what about services that are not yet supported by AWS Config? In that case you may want to (as we have done) build a suite of services to handle these tasks. Resources and configurations are registered with a service that tracks their known-desired state. Another service is responsible for querying the current state of those items and raising a flag if there is drift. Potentially another service could report on those flagged out-of-compliance resources/settings. Optionally you could deploy a service that remediates drift in your desired configuration state on all out-of-compliance resources, or possibly just a subset.
At 2nd Watch we’ve actually architected and built out our own Managed Cloud specific implementation of Automated Account Creation and Continuous Compliance. If you would rather focus your energy on your business’s core competencies and not on building foundation cloud management tooling, why not come on board and let us empower you to deliver your product and drive shareholder value in the most secure, stable, and cost-effective way possible? We’ve got the tools and the people to make it happen! Contact us to learn more.