1-888-317-7920 info@2ndwatch.com

How to Use AWS IAM with STS for access to AWS resources

With increased focus on security and governance in today’s digital economy, I want to highlight a simple but important use case that demonstrates how to use AWS Identity and Access Management (IAM) with Security Token Service (STS) to give trusted AWS accounts access to resources that you control and manage.

Security Token Service is an extension of IAM and is one of several web services offered by AWS that does not incur any costs to use.  But, unlike IAM, there is no user interface on the AWS console to manage and interact with STS. Rather all interaction is done entirely through one of several extensive SDKs or directly using common HTTP protocol.  I will be using Terraform to create some simple resources in my sandbox account and .NET Core SDK to demonstrate how to interact with STS.

The main purpose and function of STS is to issue temporary security credentials for AWS resources to trusted and authenticated entities.  These credentials operate identically to the long-term keys that typical IAM users have, with a couple of special characteristics:

  • They automatically expire and become unusable after a short and defined period of time elapses
  • They are issued dynamically

These characteristics offer several advantages in terms of application security and development and are useful for cross-account delegation and access.  STS solves two problems for owners of AWS resources:

  • Meets the IAM best-practices requirement to regularly rotate access keys
  • You do not need to distribute access keys to external entities or store them within an application

One common scenario where STS is useful involves sharing resources between AWS accounts.  Let’s say, for example, that your organization captures and processes data in S3, and one of your clients would like to push large amounts of data from resources in their AWS account to an S3 bucket in your account in an automated and secure fashion.

While you could create an IAM user for your client, your corporate data policy requires that you rotate access keys on a regular basis, and this introduces challenges for automated processes.  Additionally, you would like to limit the distribution of access keys to your resources to external entities.  Let’s use STS to solve this!

To get started, let’s create some resources in your AWS cloud.  Do you even Terraform, bro?

Let’s create a new S3 bucket and set the bucket ACL to be private, meaning nobody but the bucket owner (that’s you!) has access.  Remember that bucket names must be unique across all existing buckets, and they should comply with DNS naming conventions.  Here is the Terraform HCL syntax to do this:

Great! We now have a bucket… but for now, only the owner can access it.  This is a good start from a security perspective (i.e. “least permissive” access).

What an empty bucket may look like

Let’s create an IAM role that, once assumed, will allow IAM users with access to this role to have permissions to put objects into our bucket.  Roles are a secure way to grant trusted entities access to your resources.  You can think about roles in terms of a jacket that an IAM user can wear for a short period of time, and while wearing this jacket, the user has privileges that they wouldn’t normally have when they aren’t wearing it.  Kind of like a bright yellow Event Staff windbreaker!

For this role, we will specify that users from our client’s AWS account are the only ones that can wear the jacket. This is done by including the client’s AWS account ID in the Principal statement.  AWS Account IDs are not considered to be secret, so your client can share this with you without compromising their security.  If you don’t have a client but still want to try this stuff out, put your own AWS account ID here instead.


Great, now we have a role that our trusted client can wear.  But, right now our client can’t do anything except wear the jacket.  Let’s give the jacket some special powers, such that anyone wearing it can put objects into our S3 bucket.  We will do this by creating a security policy for this role.  This policy will specify what exactly can be done to S3 buckets that it is attached to. Then we will attach it to the bucket we want our client to use.  Here is the Terraform syntax to accomplish this:

A couple things to note about this snippet – First, we are using Terraform interpolation to inject values from previous terraform statements into a couple of places in the policy – specifically the ARN from the role and bucket we created previously.  Second, we are specifying a condition for the s3 policy – one that requires a specific object ACL for the action s3:PutObject, which is accomplished by including the HTTP request header x-amz-acl to have a value of bucket-owner-full-control with the PUT object request.  By default, objects PUT in S3 are owned by the account that created them, even if it is stored in someone else’s bucket.  For our scenario, this condition will require your client to explicitly grant ownership of objects placed in your bucket to you, otherwise the PUT request will fail.

So, now we have a bucket, a policy in place on our bucket, and a role that assumes that policy.  Now your client needs to get to work writing some code that will allow them to assume the role (wear the jacket) and start putting objects into your bucket.  Your client will need to know a couple of things from you before they get started:

  1. The bucket name and the region it was created in (the example above created a bucket named d4h2123b9-xaccount-bucket in us-west-2)
  2. The ARN for the role (Terraform can output this for you). It will look something like this but will have your actual AWS Account ID: arn:aws:iam::123456789012:role/sts-delegate-role

They will also need to create an IAM User in their account and attach a policy allowing the user to assume roles via STS.  The policy will look similar to this:


Let’s help your client out a bit and provide some C# code snippets for .NET Core 2.0 (available for Windows, macOS and LinuxTo get started, install the .NET SDK for your OS, then fire up a command prompt in a favorite directory and run these commands:

The first command will create a new console app in the subdirectory s3cli.  Then switch context to that directory and import the AWS SDK for .NET Core, and then add packages for SecurityToken and S3 services.
Once you have the libraries in place, fire up your favorite IDE or text editor (I use Visual Studio Code), then open Program.cs and add some code:

This snippet sends a request to STS for temporary credentials using the specified ARN.  Note that the client must provide IAM user credentials to call STS, and that IAM user must have a policy applied that allows it to assume a role from STS.

This next snippet takes the STS credentials, bucket name, and region name, and then uploads the Program.cs file that you’re editing and assigns it a random key/name.  Also note that it explicitly applies the Canned ACL that is required by the sts-delegate-role:

So, to put this all together, run this code block and make the magic happen!  Of course, you will have to define and provide proper variable values for your environment, including  securely storing your credentials.

Try it out from the command prompt:

If all goes well, you will have a copy of Program.cs in the bucket. Not very useful itself, but it illustrates how to accomplish the task.

What a bucket with something in it may look like

Here is a high-level document of what we put together:

Putting it all together

Steps:

  1. Your client uses their IAM user to call AWS STS and requests the role ARN you gave them
  2. STS authenticates the client’s IAM user and verifies the policy for the ARN role, then issues a temporary credential to the client.
  3. The client can use the temporary credentials to access your S3 bucket (they will expire soon), and since they are now wearing the Event Staff jacket, they can successfully PUT stuff in your bucket!

There are many other use-cases for STS. This is just one very simplistic example. However, with this brief introduction to the concepts, you should now have a decent idea of how STS works with IAM roles and policies, and how you can use STS to give access to your AWS resources for trusted entities. For more tips like this, contact us.

-Jonathan Eropkin, Cloud Consultant

Facebooktwittergoogle_pluslinkedinmailrss

Corrupted Stolen CPU Time

There is a feature in the Linux Kernel that is relevant to VM’s hosted on Xen servers that is called the “steal percentage.”  When the OS requests from the host system’s use of the CPU and the host CPU is currently tied up with another VM, the Xen server will send an increment to the guest Linux instance which increases the steal percentage.  This is a great feature as it shows exactly how busy the host system is, and it is a feature available on many instances of AWS as they host using Xen.  It is actually said that Netflix will terminate an AWS instance when the steal percentage crosses a certain threshold and start it up again, which will cause the instance to spin up in a new host server as a proactive step to ensure their system is utilizing their resources to the fullest.

What I wanted to discuss here is that it turns out there is a bug in the Linux kernel versions 4.8, 4.9 and 4.10 where the steal percentage can be corrupted during a live migration on the physical Xen server, which causes the CPU utilization to be reported as 100% by the agent.

When looking at Top you will see something like this:

As you can see in the screen shot of Top, the %st metric on the CPU(s) line shows an obviously incorect number.

During a live migration on the physical Xen server, the steal time gets a little out of sync and ends up decrementing the time.  If the time was already at or close to zero, itcauses the time to become negative and, due to type conversions in the code, it causes an overflow.

CloudWatch’s CPU Utilization monitor calculates that utilization by adding the System and User percentages together.  However, this only gives a partial view into your system.  With our agent, we can see what the OS sees.

That is the Steal percentage spiking due to that corruption.  Normally this metric could be monitored and actioned as desired, but with this bug it causes noise and false positives.  If Steal were legitimately high, then the applications on that instance would be running much slower.

There is some discussion online about how to fix this issue, and there are some kernel patches to say “if the steal time is less than zero, just make it zero.”  Eventually this fix will make it through the Linux releases and into the latest OS version, but until then it needs to be dealt with.

We have found that a reboot will clear the corrupted percentage.  The other option is to patch the kernel… which also requires a reboot.  If a reboot is just not possible at the time, the only impact to the system is that it makes monitoring the steal percentage impossible until the number is reset.

It is not a very common issue, but due to the large number of instances we monitor here at 2nd Watch, it is something that we’ve come across frequently enough to investigate in detail and develop a process around.

If you have any questions as to whether or not your servers hosted in the cloud might be effected by this issue, please contact us to discuss how we might be able to help.

-James Brookes, Product Manager

Facebooktwittergoogle_pluslinkedinmailrss

Creating a simple Alexa Skill

Why do it?

Alexa gets a lot of use in our house, and it is very apparent to me that the future is not a touch screen or a mouse, but voice.  Creating an Alexa skill is easy to learn by watching videos and such, but actually creating the skill is a great way to understand the ins and outs of the process and what the backend systems (like AWS Lambda) are capable of.

First you need a problem

To get started, you need a problem to solve.  Once you have the problem, you’ll need to think about the solution before you write a line of code.  What will your skill do?  You need to define the requirements.  For my skill, I wanted to ask Alexa to “park my cloud” and have her stop all EC2 instances or RDS databases in my environment.

Building a solution one word at a time

Now that I’ve defined the problem and have an idea for the requirements of the solution, it’s time to start building the skill.  The first thing you’ll notice is that the Alexa Skill port is not in the standard AWS portal.  You need to go to developer.amazon.com/Alexa and create a developer account and sign in there.  Once inside, there is a lot of good information and videos on creating Alexa skills that are worth reviewing.  Click the “Create Skill” button to get started.  In my example, I’m building a custom skill.

Build

The process for building a skill is broken into major sections; Build, Test, Launch, Measure.  In each one you’ll have a number of things to complete before moving on to the next section.  The major areas of each section are broken down on the left-hand side of the console.  On the initial dashboard you’re also presented with the “Skill builder checklist” on the right as a visual reminder of what you need to do before moving on.

Interaction model

This is the first area you’ll work on in the Build phase of your Alexa skill.  This is setting up how your users will interact with your skill.

Invocation

Invocation will setup how your users will launch your skill.  For simplicity’s sake, this is often just the name of the skill.  The common patterns will be “Alexa, ask [my skill] [some request],” or “Alexa, launch [my skill].”  You’ll want to make sure the invocation for your skill sounds natural to a native speaker.

Intents

I think of intents as the “functions” or “methods” for my Alexa skill.  There are a number of built-in intents that should always be included (Cancel, Help, Stop) as well as your custom intents that will compose the main functionality of your skill.  Here my intent is called “park” since that will have the logic for parking my AWS systems.  The name here will only be exposed to your own code, so it isn’t necessarily important what it is.

Utterances

Utterances is your defined pattern of how people will use your skill.  You’ll want to focus on natural language and normal patterns of speech for native users in your target audience.  I would recommend doing some research and speaking to a diversity of people to get a good cross section of utterances for your skill.  More is better.

Slots

Amazon also provides the option to use slots (variables) in your utterances.  This allows your skill to do things that are dynamic in nature.  When you create a variable in an utterance you also need to create a slot and give it a slot type.  This is like providing a type to a variable in a programming language (Number, String, etc.) and will allow Amazon to understand what to expect when hearing the utterance.  In our simple example, we don’t need any slots.

Interfaces

Interfaces allow you to interface your skill with other services to provide audio, display, or video options.  These aren’t needed for a simple skill, so you can skip it.

Endpoint

Here’s where you’ll connect your Alexa skill to the endpoint you want to handle the logic for your skill.  The easiest setup is to use AWS Lambda.  There are lots of example Lambda blueprints using different programming languages and doing different things.  Use those to get started because the json response formatting can be difficult otherwise.  If you don’t have an Alexa skill id here, you’ll need to Save and Build your skill first.  Then a skill id will be generated, and you can use it when configuring your Lambda triggers.

AWS Account Lambda

Assuming you already have an AWS account, you’ll want to deploy a new Lambda from a blueprint that looks somewhat similar to what you’re trying to accomplish with your skill (deployed in US-East-1).  Even if nothing matches well, pick any one of them as they have the json return formatting set up so you can use it in your code.  This will save you a lot of time and effort.  Take a look at the information here and here for more information about how to setup and deploy Lambda for Alexa skills.  You’ll want to configure your Alexa skill as the trigger for the Lambda in the configuration, and here’s where you’ll copy in your skill id from the developer console “Endpoints” area of the Build phase.

While the actual coding of the Lambda isn’t the purpose of the article, I will include a couple of highlights that are worth mentioning.  Below, see the part of the code from the AWS template that would block the Lambda from being run by any Alexa skill other than my own.  While the chances of this are rare, there’s no reason for my Lambda to be open to everyone.  Here’s what that code looks like in Python:

if (event[‘session’][‘application’][‘applicationId’] != “amzn1.ask.skill.000000000000000000000000”):

raise ValueError(“Invalid Application ID”)

Quite simply, if the Alexa application id passed in the session doesn’t match my known Alexa skill id, then raise an error.  The other piece of advice I’d give about the Lambda is to create different methods for each intent to keep the logic separated and easy to follow.  Make sure you remove any response language from your code that is from the original blueprint.  If your responses are inconsistent, Amazon will fail your skill (I had this happen multiple times because I borrowed from the “Color Picker” Lambda blueprint and had some generic responses left in the code).  Also, you’ll want to handle your Cancel, Help, and Stop requests correctly.  Lastly, as best practice in all code, add copious logging to CloudWatch so you can diagnose issues.  Note the ARN of your Lambda function as you’ll need it for configuring the endpoints in the developer portal.

Test

Once your Lambda is deployed in AWS, you can go back into the developer portal and begin testing the skill.  First, put your Lambda function ARN into the endpoint configuration for your skill.  Next, click over to the Test phase at the top and choose “Alexa Simulator.”  You can try recording your voice on your computer microphone or typing in the request.  I recommend you do both to get a sense of how Alexa will interpret what you say and respond.  Note that I’ve found the actual Alexa is better at natural language processing than the test options using a microphone on my laptop.  When you do a test, the console will show you the JSON input and output.  You can take this INPUT pane and copy that information to build a Lambda test script on your Lambda function.  If you need to do a lot of work on your Lambda, it’s a lot easier to test from there than to flip back and forth.  Pay special attention to your utterances.  You’ll learn quickly that your proposed utterances weren’t as natural as you thought.  Make updates to the utterances and Lambda as needed and keep testing.

Launch

Once you have your skill in a place where it’s viable, it’s time to launch.  Under “Skill Preview,” you’ll pick names, icons, and a category for your skill.  You’ll also create “cards” that people will see in the Alexa app that explain how to use your skill and the kinds of things you can say.  Lastly, you’ll need a URL to your “Privacy Policy” and “Terms of Use.”  Under “Privacy and Compliance,” you answer more questions about what kind of information your skill collects, whether it targets children or contains advertising, if it’s ok to export, etc.  Note that if you categorize your skill as a “Game,” it will automatically be considered to be directed at children, so you may want to avoid that with your first skill if you collect any identifiable information. Under “Availability,” you’ll decide if the skill should be public and what countries you want to be able to access the skill.  The last section is “Submission,” where you hopefully get the green checkmark and be allowed to submit.

Skill Certification

Now you wait.  Amazon seems to have a number of automated processes that catch glaring issues, but you will likely end up with some back and forth between yourself and an Amazon employee regarding some part of your skill that needs to be updated.  It took about a week to get my final approval and my skill posted.

Conclusion

Creating your own simple Alexa skill is a fun and easy way to get some experience creating applications that respond to voice and understand what’s possible on the platform.  Good luck!

-Coin Graham, Senior Cloud Consultant

Facebooktwittergoogle_pluslinkedinmailrss

Fully Coded And Automated CI/CD Pipelines: The Process

The Why

Originally, I thought I’d give a deep dive into the mechanics of some of our automated work flows at 2nd Watch, but I really think it’s best to start at the very beginning. We need to understand “why” we need to automate our delivery. In enterprise organizations, this delivery is usually setup in a very waterfall way. The artifact is handed off to another team to push to the different environments and to QA to test. Sometimes it works but usually not so much. In the “not so much” case, it’s handed back to DEV which interrupts their current work.

That back and forth between teams is known as waste in the Lean/Agile world. Also known as “Throwing it over the wall” or “Hand offs.” This is a primary aspect any Agile process intends to eliminate and what’s led to the DevOps movement.

Now DevOps is a loaded term, much like “Agile” and “SCRUM.” It has its ultimate meaning, but most companies go part way and then call it won. The changes that get the biggest positive effects are cultural, but many look at it and see the shiny “automation” as the point of it all. Automation helps, but the core of your automation should be driven by the culture of quality over all. Just keep that in mind as you read though this article, which is specifically about all that yummy automation.

There’s a Process

Baby steps here. You can’t have one without the other, so there are a series of things that need to happen before you make it to a fully automated process.

Before that though, we need to look at what a deployment is and what the components of a deployment are.

First and foremost, so long as you have your separate environments, development has no impact on the customer, therefore no impact to the business at large. There is, essentially, no risk while a feature is in development. However, the business assumes ALL the risk when there is a deployment. Once you cross that line, the customer will interact with it and either love it, hate it, or ignore it. From a development standpoint, you work to minimize that risk before you cross the deployment line – Different environments, testing, a release process etc. These are all things that can be automated, but only when that risk has been sufficiently minimized.

Step I: Automated testing

You can’t do CI or CD without testing, and it’s the logical first step. In order to help minimize the deployment risk, you should automate ALL of your testing. This will greatly increase your confidence that changes introduced will not impact the product in ways that you may not know about BEFORE you cross that risk point in deployment. The closer an error occurs to the time at which it’s implemented the better. Automated testing greatly reduces this gap by providing feedback to the implementer faster while being able to provide repeatable results.

Step II: Continuous Integration

Your next step is to automate your integration (and integration tests, right?), which should further provide you with confidence in the change that you and your peers have introduced. The smaller the gap between integrations (just as with testing), the better, as you’ll provide feedback to the implementers faster. This means you can operate on any problems while your changes are fresh in your mind. Utilizing multiple build strategies for the same product can help as well. For instance, running integration on a push to your scm (Source Control Management), as well as nightly builds.

Remember, this is shrinking that risk factor before deployment.

Step III: Continuous Deployment

With Continuous Deployment you take traditional Continuous Delivery a step further by automatically pushing the artifacts created by a Continuous Delivery process into production. This automation of deployments is the final step and is another important process in mitigating that risk for when you push to production. Deploying to each environment and then running that environment’s specific set of tests is your final check before you are able to say with confidence that a change did not introduce a fault. Remember, you can automate the environments as well by using infrastructure as code tooling around virtual technology (i.e. The Cloud).

Conclusion

Continuous Deployment is the ultimate goal, as a change introduced into the system triggers all of your confidence-building tools to minimize the risk to your customers once it’s been deployed to the production system. Automating it all not only improves the quality, but reduces the feedback to the implementer, increasing efficiency as well.

I hope that’s a good introduction! In our next post, we’ll take a more technical look at the tooling and automation we use in one of our work flows.

-Craig Monson, Sr Automation Architect (Primary)

-Lars Cromley, Director of Engineering (Editor)

-Ryan Kennedy, Principal Cloud Automation Architect (Editor)

Facebooktwittergoogle_pluslinkedinmailrss

Five Pitfalls to Avoid When Migrating to the Cloud

Tech leaders are increasingly turning to the cloud for cost savings and convenience, but getting to the cloud is not so easy. Here are the top 5 pitfalls to avoid when migrating to the cloud.

1. Choosing the wrong migration approach

There are 6 migration strategies, and getting to these 6 takes a considerable amount of work. Jumping into a cloud migration without the due diligence, analysis, grouping and risk ranking is ill-advised. Organizations need to conduct in depth application analyses to determine the correct migration approach. Not all applications are cloud ready and those that are may take some toying with when there. Take the time to really understand HOW your application works, how it will work in the cloud and what needs to be done to migrate it there successfully. 2nd Watch approaches all cloud migrations using our Cloud Factory Model, which includes the following phases – discovery, design and requirement gathering, application analysis, migration design, migration planning and migration(s).

These 6 migration strategies include:

  • Retain – Leaving it as is. It could be a mistake to move the application to the cloud.
  • Rehost “aka” Lift and Shift – Migrating the application as-is into the cloud.
  • Replatform – Characterized as re-imagining how the application is architected and developed, typically using cloud-native features. Basically, you’re throwing away and designing something new or maybe switching over to a SaaS tool altogether.
  • Retire – No migration target and/or application host decommission on source.
  • Re-architect/Refactor – Migration of the current application to use the cloud in the most efficient, thorough way possible, incorporating the best features to modernize the application. This is the most complex migration method as it often involves re-writing of code to decouple the application to fully support all the major benefits the cloud provides. The redesign and re-engineering of the application and infrastructure architecture are also key in this type of migration.

From a complexity standpoint, replatform and rearchitect/refactor are the most complicated migration approaches. However, it depends on the application and how you are replatforming (for example, if you’re going to SaaS, it may be a very simple transition.  If you’re rebuilding your application on Lambda and DynamoDB, not so much).

2. Big Bang Migration

Some organizations are under the impression that they must move everything at once. This is the furthest from the truth. The reality is that organizations are in hybrid models (On-Prem and Cloud) for a long time because it’s very hard to move some workloads.

It is key to come up with a migration design and plan which includes a strategic portfolio analysis or Cloud Readiness Assessment that assesses each application’s cloud readiness, identifies dependencies between applications, ranks applications by complexity and importance, and identifies the ideal migration path.

3. Underestimating Work Involved and Integration

Migrating to the cloud is not a walk in the park. You must have the knowledge, skill and solid migration design to successfully migrate workloads to the cloud. When businesses hear the words “lift and shift” they are mistakenly under the impression that all one has to do is press a button (migrate) and then it’s “in the cloud.” This is a misnomer that needs to be explained. Underestimating integration is one of the largest factors of failure.

With all of the cheerleading about of the benefits of moving to the cloud, deploying to the cloud adds a layer of complexity, especially when organizations are layering cloud solutions on top of legacy systems and software. It is key to ensure that the migration solution chosen is able to be integrated with your existing systems. Moving workloads to the cloud requires integration and an investment in it as well. Organizations need to have a solid overall architectural design and determine what’s owned, what’s being accessed and ultimately what’s being leveraged.

Lastly, code changes that are required to make move are also often underestimated. Organizations need to remember it is not just about moving virtual machines. The code may not work the same way running in the cloud, which means the subsequent changes required may be deep and wide.

4. Poor Business Case

Determine the value of a cloud migration before jumping into one. What does this mean? Determine what your company expects to gain after you migrate. Is it cost savings from exiting the data center? Will this create new business opportunities? Faster time to market? Organizations need to quantify the benefits before you move.

I have seen some companies experience buyer’s remorse due to the fact that their business case was not multifaceted. It was myopic – exiting the datacenter. Put more focus on the benefits your organization will receive from the agility and ability to enter new markets faster using cloud technologies. Yes, the CapEx savings are great, but the long-lasting business impacts carry a lot of weight as well because you might find that, once you get to the cloud, you don’t save much on infrastructure costs.

5. Not trusting Project Management

An experienced, well versed and savvy project manager needs to lead the cloud migration in concert with the CIO. While the project manager oversees and implements the migration plan and leads the migration process and technical teams, the CIO is educating the business decision makers at the same time. This “team” approach does a number of things. First, it allows the CIO to act as the advisor and consultant to the business – helping them select the right kind of services to meet their needs. Second, it leaves project management to a professional. And lastly, by allowing the project manager to manage, the CIO can evaluate and monitor how the business uses the service to make sure it’s providing the best return on investment.

Yvette Schmitter, Sr Manager, Project Portfolio Management, 2nd Watch

Facebooktwittergoogle_pluslinkedinmailrss