Amazon Web Services best practices tell us to build for stateless systems, in a perfect world any server can serve any function with absolutely no impact to customers. Sounds great, but unfortunately reality interjects into our perfect world and we find many websites and applications are not so perfectly stateless. So how can we make use of the strengths of AWS in areas like elasticity and auto scaling without completely re-writing applications to conform? After all, one of the key benefits to moving into the Cloud is cost savings which get eaten away by spending development resources rewriting code.
The solution is thankfully built-in to Amazon’s Elastic Load Balancer (ELB), so those that require sessions to remain open for a customer can enable that “sticky” option. This keeps transactions processing, real time communication alive, and businesses from needing to redesign such code or give up auto scaling. So how does it work?
The first option is to create duration-based session stickiness. This is enabled at the ELB under port configuration. From there, the “stickiness” option can be enabled, and the ELB will generate a session cookie with a limited duration (default is 60 seconds). So long as the client checks in with the ELB before the cookie expires, the session is held on that instance and that instance will not be terminated by auto scaling. The second option is to enable application-controlled stickiness. This requires more development effort unless the existing platform already makes use of custom cookies; however this gives far more control to application developers than a basic number of seconds before timeout. By using application control a web developer can keep a client connection directed to a specific instance through the ELB with no fear that a required instance will be terminated prematurely.
-Keith Homewood, Cloud Architect
After a seven-year career at Cisco, I am thrilled to be working with 2nd Watch to help make cloud work for our customers. Disaster recovery is a great use case for companies looking for a quick cloud win. Traditional on-premise backup and disaster recovery technologies are complex and can require major capital investments. I have seen many emails from IT management to senior management explaining the risk to their business if they do not spend money on backup/DR. The associated costs with on-premise solutions for DR are often put off to next year’s budget and seem to always get cut midway throughout the year. When the business does invest in backup/DR, there are countless corners cut in order to maximize the reliability and performance with a limited budget.
The public cloud is a great resource for addressing the requirements of backup and disaster recovery. Organizations can avoid the sunk costs of data center infrastructure (power, cooling, flooring, etc.) while having all of the performance and resources available for their growing needs. Atlanta based What’s Up Interactive saved over $1 million with their AWS DR solution (case study here).
I will highlight a few of the top benefits any company can expect when leveraging AWS for their disaster recovery project.
1. Eliminate costly tape backups that require transporting, storing, and retrieving the tape media. This is replaced by fast disk-based storage that provides the performance needed to run your mission-critical applications.
2. AWS provides an elastic solution that can scale to the growth of data or application requirements without the costly capital expenses of traditional technology vendors. Companies can also expire and delete archived data according to organizational policy. This allows companies to pay for only what they use.
3. AWS provides a secure platform that helps companies meet compliance requirements due to easy access to data for deadlines that is secure and durable.
Every day we help new customers leverage the cloud to support their business. Our team of Solutions Architects and Cloud Engineers can assist you in creating a plan to reduce risk in your current backup/DR solution. Let us know how we can help you get started in your journey to the cloud.
-Matt Whitney, Cloud Sales Executive
Auto-Scaling gives the ability to scale your EC2 instances up or down according to demand to handle the load on the service. With auto-scaling you don’t have to worry about whether or not the number of instances you’re using will be able to handle a demand spike or if you’re overspending during a slower period. Auto-scaling automatically scales for you for seamless performance.
For instance, if there are currently 3 m1.xlarge instances handling the service, and they spend a large portion of their time only 20% loaded with a smaller portion of their time heavily loaded, they can be vertically scaled down to smaller instance sizes and horizontally scaled out/in to more or less instances to automatically accommodate whatever load they have at that time. This can also save many dollars by only paying for the smaller instance size. More savings can be attained by using reserved instance billing for the minimum number of instances defined by the Auto-Scaling configuration and letting those scaled out instances pay the on-demand rate while running. This is a little tricky though because an instance billing cannot be changed while the instance is running. When scaling down, make sure to terminate the newest instances, since they are running at the on-demand billing rate.
Vertical Scaling is typically referred to as scale-up or scale-down by changing the instance size, while Horizontal Scaling is typically referred to as scale-out or scale-in by changing the number of instances.
When traffic on AWS Service has predictable or unpredictable increases or decreases, Auto-Scaling can keep customers happy with the service because their response times stay more consistent and High Availability is more reliable.
Auto-Scaling to Improve HA
If there is only one server instance, Auto-scaling can be used to put a new server in place, in a few minutes, when the running one fails. Just set both Min and Max number of instances to 1.
Auto-Scaling to Improve Response Time Consistency
If there are multiple servers and the load on them becomes so heavy that the response time slows, expand horizontally only for the time necessary to cover the extra load, and keep the response time low.
AWS Auto-Scaling Options to Set
When Auto-Scaling up or down, there are a lot of things to think about:
- Evaluation Period is the time, in seconds, between checks of the load on the Scaling Group.
- Cool Down is the time, in seconds, after a scaling operation that a new scaling operation can be performed. When scaling out, this time should be fairly short in the event that the load is too heavy for one Scale-Up operation. When scaling in, this time should be at least twice that of the Scale-Out operation.
- With Scale-Out, make sure it scales fast enough to quickly handle a load heavier than one expansion. 300 seconds is a good starting point.
- With Scale-In, make sure it scales slow enough to not keep going out and in. We call this “Flapping”. Some call it “Thrashing”.
- When the Auto-Scale Group includes multiple AZs, Scaling out and in should be incremented by the number of AZs involved. If only one AZ is scaled up and something happens to that AZ, noticeability in a bad way goes up.
- Scale-In can be accomplished by different rules:
- Terminate Oldest Instance
- Terminate Newest Instance
- Terminate Instance Closest to the next Instance Hour (Best Cost Savings)
- Terminate Oldest Launch Configuration (default)
Auto-Scaling is a two stage process, and here is the rub. The AWS Management Console does not do Auto-Scaling so it has to be done through AWS APIs.
- Set up the Launch Configuration and assign it to a group of instances you want to control. If there is no user_data file that argument can be left out. The block-device-mapping argument can be found in the details for the ami_id.
- # as-create-launch-config <auto_scaling_launch_config_name> –region <region_name> –image-id <AMI_ID> –instance-type <type> –key <SSH_key_pair_name> –group <VPC_security_group_ID> –monitoring-enabled –user-data-file=<path_and_name_for_user_data_file> –block-device-mapping “<device_name>=<snap_id>:100:true:standard”
- # as-create-auto-scaling-group <auto_scaling_group_name> –region <region_name> –launch-configuration <auto_scaling_launch_config_name> –vpc-zone-identifier <VPC_Subnet_ID>,<VPC_Subnet_ID> –availability-zones <Availability_Zone>,<Availability_Zone> –load-balancers <load_balancer_name> –min-size <min_number_of_instances_that_must_be_running> –max-size <max_number_of_instances_that_can_be_running> –health-check-type ELB –grace-period <time_seconds_before_first_check> –tag “k=Name, v=<friendly_name>, p=true”
- Have CloudWatch initiate Scaling Activities. One CloudWatch Alert for Scaling Out and one for Scaling In. Also send notifications when scaling.
- Scale Out (Alarm Actions output from first command are used by second command argument)
- # as-put-scaling-policy –name <auto_scaling_policy_name_for_high_CPU> –region <region_name> –auto-scaling-group <auto_scaling_group_name> –adjustment <Number_of_instances_to_change_by> –type ChangeInCapacity –cooldown <time_in_seconds_to_wait_to_check_after_adding_instances>
- # mon-put-metric-alarm –alarm-name <alarm_name_for_high_CPU> –region <region_name> –metric-name CPUUtilization –namespace AWS/EC2 –statistic Average –period <number_of_seconds_to_check_each_time_period> –evaluation-periods <number_of_periods_between_checks> –threshold <percent_number> –unit Percent –comparison-operator GreaterThanThreshold –alarm-description <description_use_alarm_name> –dimensions “AutoScalingGroupName=<auto_scaling_group_name>” –alarm-actions <arn_string_from_last_command>
- Scale In(Alarm Actions output from first command used as second command argument)
- # as-put-scaling-policy –name <auto_scaling_policy_name_for_low_CPU> –region <region_name> –auto-scaling-group <auto_scaling_group_name> “–adjustment=-<Number_of_instances_to_change_by> ” –type ChangeInCapacity –cooldown <time_in_seconds_to_wait_to_check_after_removing_instances>
- # mon-put-metric-alarm –alarm-name <alarm_name_for_low_CPU> –region <region_name> –metric-name CPUUtilization –namespace AWS/EC2 –statistic Average –period <number_of_seconds_to_check_each_time_period> –evaluation-periods <number_of_periods_between_checks> –threshold <percent_number> –unit Percent –comparison-operator LessThanThreshold –alarm-description <description_use_alarm_name> –dimensions “AutoScalingGroupName=<auto_scaling_group_name>” –alarm-actions <arn_string_from_last_command>
AMI Changes Require Auto-Scaling Updates
The instance configuration could change for any number of reasons:
- Security Patches
- New Features added
- Removal of un-used Old Features
Whenever the AMI specified in the Auto-Scaling definition is changed, the Auto-Scaling Group needs to be updated. The update requires creating a new Scaling Launch Config with the new AMI ID, updating the Auto-Scaling Group, then deleting the old Scaling Launch Config. Without this update the Scale out operation will use the old AMI.
1. Create new Launch Config:
# as-create-launch-config <new_auto_scaling_launch_config_name> –region <region_name> –image-id <AMI_ID> –instance-type <type> –key <SSH_key_pair_name> –group <VPC_security_group_ID> –monitoring-enabled –user-data-file=<path_and_name_for_user_data_file> –block-device-mapping “<device_name>=<snap_id>:100:true:standard”
2. Update Auto Scaling Group:
# as-update-auto-scaling-group <auto_scaling_group_name> –region <region_name> –launch-configuration <new_auto_scaling_launch_config_name> –vpc-zone-identifier <VPC_Subnet_ID>,<VPC_Subnet_ID> –availability-zones <Availability_Zone>,<Availability_Zone> –min-size <min_number_of_instances_that_must_be_running> –max-size <max_number_of_instances_that_can_be_running> –health-check-type ELB –grace-period <time_seconds_before_first_check>
3. Delete Old Auto-Scaling Group:
as-delete-launch-config <old_auto_scaling_launch_config_name> –region <region_name> –force
Now all Scale Outs should use the updated AMI.
-Charles Keagle, Senior Cloud Engineer
To leverage the full benefits of Amazon Web Services (AWS) and features such as instant elasticity and scalability, every AWS architect eventually considers Elastic Load Balancing and Auto Scaling. These features enable the ability to instantly scale-in or scale-out an environment based on the flow of internet traffic.
Once implemented, how do you the configuration and application to make sure they’re scaling with the parameters you’ve set? You could always trust the design and logic, then wait for the environment to scale naturally with organic traffic. However, in most production environments this is not an option. You want to make sure the environment operates adequately under load. One cool way to do this is by generating a distributed traffic load through a program called Bees with Machine Guns.
The author describes Bees with Machine Guns as “A utility for arming (creating) many bees (micro EC2 instances) to attack (load ) targets (web applications).” This is a perfect solution for ing performance and functionality of an AWS environment because it allows you to use one master controller to call many bees for a distributed attack on an application. Using a distributed attack from several bees gives a more realistic attack profile that you can’t get from a single node. Bees with Machine Guns enables you to mount an attack with one or several bees with the same amount of effort.
Bees with Machine Guns isn’t just a randomly found open source tool. AWS endorses the project in several places on their website. AWS recommends Bees with Machine Guns for distributed ing in their article “Best Practices in Evaluating Elastic Load Balancing”. The author says “…you could consider tools that help you distribute s, such as the open source Fabric framework combined with an interesting approach called Bees with Machine Guns, which uses the Amazon EC2 environment for launching clients that execute s and report the results back to a controller.” AWS also provides a CloudFormation template for deploying Bees with Machine Guns on their AWS CloudFormation Sample Templates page.
To install Bees with Machine Guns you can either use the template provided on the AWS CloudFormation Sample Templates page called bees-with-machineguns.template or follow the install instructions from the GitHub project page. (Please be aware the template also deploys a scalable spot instance auto scale group behind an elastic load balancer, all of which you are responsible to pay for.)
Once the Bees with Machine Guns source is installed. You have the ability to run the following commands:
The first command we run will start up five bees that we will have control over for ing. We can use the –s option to specify the number of bees we want to spin up. The –k option is the SSH key pair name used to connect to the new servers. The –I option is the name of the AMI used for each bee. The –g option is the security group in which the bees will be launched. If the key pair, security group, and instance already exist in the region you’re launching the bees, there is less chance you will see errors when running the command.
Once launched, you can see the bees that were instantiated and under control of the Bees with Machine Guns controller with the command:
To make our bees attack we use the command “bees attack”. The options used are -u which is the URL of the target to attack. Make sure to use the trailing backslash in your URL or the command will error out. The –n is the total number of connection to make to the target. The –c option is used for the number of concurrent connections made to the target. Here in as example run of an attack:
Notice that the attack was distributed among the bees in the following manner “Each of 5 bees will fire 20 rounds, 2 at a time.” Since we had our total number of connections set to 100 each bee received an equal share of the request. Depending on your choices for the –n and –c options you can configure a different type of attack profile. For example, if you wanted to increase the time of an attack you would increase the total number of connections and the bees would take longer to complete the attack. This comes in useful when ing an auto scale group in AWS because you can configure an attack that will trigger one of your cloud watch alarms which will in turn activate a scaling action. Another trick is to use the Linux “time” command before your “bees attack” command, once the attack completes you can see the total duration of the attack.
Once the command completes you get output for the number of requests that actually completed, the requests that were made per second, the time per request, and a “Mission Assessment,” in this case the “Target crushed bee offensive”.
To spin down your fleet of bees you run the command:
This is a quick intro on how to use Bees with Machine Guns for distributed ing within AWS. The one big caution in using Bees with Machine Guns, as explained by the author, “they are, more-or-less a distributed denial-of-service attack in a fancy package,” which means you should only use it against resources that you own, and you will be liable for any unauthorized use.
As you can see, Bees with Machine Guns can be a powerful tool for distributed load s. It’s extremely easy to setup and tremendously easy to use. It is a great way to artificially create a production load to the elasticity and scalability of your AWS environment.
-Derek Baltazar, Senior Cloud Engineer
Earlier this month, AWS announced that it was integrating its Route 53 DNS service with its CloudWatch management tools. That’s a fantastic development for folks managing EC2 architectures who want monitoring and alerting to accompany their DNS management.
If you’re unfamiliar with DNS, an over-simplified analogy is a phonebook. Like a phonebook manages person or business name to phone number translation, DNS manages the translation of Internet domain names (e.g. website address: www.2ndwatch.com) to IP addresses (220.127.116.11). There’s really a lot more to DNS than that, but that is outside of the scope of what we are covering in this article, and there are some very good books on the matter if you’re interested in learning more. For those unfamiliar with Route 53, it is Amazon’s in-house simple DNS service designed with high availability, scalability, automation, and ease of use in mind. It can be a great tool for businesses with agile cloud services, basic (or even large-scale) DNS needs. Route 53 lets you build and easily manage your public DNS records as well as your private internal VPC DNS records if you so choose. In addition to the web-based AWS Management Console, Route 53 has its own API so you can fully manage your DNS zones and records in a programmatic fashion by integrating with your existing and future applications/tools. Here is a link to the Route 53 developer tools. There are also a number of free tools out there that others have written to leverage the Route 53 API. One of the more useful ones I’ve seen is a tool written in Python using the Boto lib called cli53. Here’s a good article on it and a link to the github site where you can download the la verison.
Route 53 is typically operated through the AWS Management Console. Within the management console navigation you can see two main sections: “Hosted Zones” and “Health Checks”. DNS records managed with Route 53 get organized into the “Hosted Zones” section while the “Health Checks” section is used to house simple checks that will monitor the health of endpoints used in a dynamic Routing Policy. A hosted zone is simply a DNS zone where you can store the DNS records for a particular domain name. Upon creation of a hosted zone Route 53 assigns four AWS nameservers as the zone’s “Delegation Set” and the zone apex NS records. At that point you can create, delete, or edit records and begin managing your zone. Bringing them into the real world is simply a matter of updating your master nameservers with the registrar for your domain with the four nameservers assigned to your zone’s Delegation Set.
If you’ve already got a web site or a zone you are hosting outside of AWS, you can easily import your existing DNS zone(s) and records into Route 53. You can do this either manually (using the AWS Management Console, which works OK if you only have a handful of records) or by using a command-line tool like cli53. Another option is to utilize a tool like Python and the Boto library to access the Route 53 API. If the scripting and automation pieces are out of your wheelhouse and you have a bunch of records you need to migrate don’t worry, 2nd Watch has experienced engineers who can assist you with just this sort of thing! Once you have your zones and records imported into Route 53 all that’s left to do is update your master nameservers with your domain registrar. The master nameservers are the ones assigned as the “Delegation Set” when you create your zone in Route 53. These will now be the authoritative nameservers for your zones.
Route 53 supports the standard DNS record types: A, AAAA, CNAME, MX, NS, PTR, SOA, SPF, SRV, and TXT. In addition, Route 53 includes a special type of resource record called an “alias,” which is an extension to standard DNS functionality. The alias is a type of record within a Route 53 hosted zone which is similar to a CNAME but with some important differences. The alias record maps to other AWS services like CloudFront, ELB load balancers, S3 buckets with static web-hosting enabled, or a Route 53 resource record in the same hosted zone. They differ from a CNAME record in that they are not visible to resolvers. Resolvers only see the A record and the resulting IP address of the target record. An Alias also supports using a zone apex as the target, which is another feature that standard CNAME records don’t support. This has the advantage of completely masking the somewhat cryptic DNS names associated with CloudFront, S3, ELB, and other resources. It allows you to disguise the fact that you’re utilizing a specific AWS service or resource if you so desire.
One of the more useful features in Route 53 is the support for policy based routing. You can configure it to answer DNS queries with specific IPs from a group based on the following policies:
- Routing traffic to the region with the lowest latency in relation to the client
- If an IP in the group fails its healthcheck Route 53 will no longer answer queries with that IP
- Using specific ratios to direct more or less traffic at certain IPs in the group
- Just standard round-robin DNS which ensures an even distribution of traffic to all IPs in the group
*NOTE: It is important to keep TTL in mind when using routing based policies with healthchecks as a lower TTL will make your applications more responsive to dynamic changes when a failure is detected. This is especially important if you are using the Failover policy as a traffic switch for HA purposes.
As mentioned at the beginning of this article, Amazon recently announced that it has added Route 53 support to CloudWatch. This means you can use CloudWatch to monitor your Route 53 zones, check health, set threshold alarms and trigger events based on health data returned. Any Route 53 health check you configure gets turned into a CloudWatch metric, so it’s constantly available in the CloudWatch management console and viewable in a graph view as well as the raw metrics.
If you’re running your web site off of an EC2 server farm or are planning on making a foray into AWS you should definitely look into both Route 53 and CloudWatch. This combination not only helps with initial DNS configuration, but the CloudWatch integration now makes it especially useful for monitoring and acting upon events in an automated fashion. Check it out.
-Ryan Kennedy, Senior Cloud Engineer
There are an endless supply of articles talking about “the dangers of the hidden costs of cloud computing”. Week after week there’s a new article from a new source highlighting (in the same way) how the movement to cloud won’t help the bottom line of a business because all of the “true costs” are not fully considered by most until it’s “too late”. Too late for what? These articles are an empty defensive move because of the inevitable movement our industry is experiencing toward cloud. Now to be fair…are some things overlooked by folks? Yes. Do some people jump in too quickly and start deploying before they plan properly? Yes. Is cloud still emerging/evolving with architecture, deployment and cost models shifting on a quarterly (if not monthly) basis? Yes. But, this is what makes cloud so exciting. It’s a chance for us to rethink how we leverage technology, and isn’t that what we’ve done for years in IT? Nobody talks about the hidden savings of cloud nor do they talk about the unspoken costs with status quo infrastructure.
Before jumping into an organization that was cloud-first, I worked for 13 years, in many roles, at an infrastructure/data center-first organization, and we did very well and helped many people. However, as the years progressed and as cloud went from a gimmick to a fad to a buzzword to now a completely mainstream and enterprise IT computing platform, I saw a pattern developing in that traditional IT data center projects were costing more and more whereas cloud was looking like it cost less. I’ll give you an unnamed customer example.
Four years ago a customer of mine who was growing their virtual infrastructure (VMware) and their storage infrastructure (EMC) deployed a full data center solution of compute, storage and virtualization that cost in the 4 million dollar range. From then until now they added some additional capacity overall for about another 500K. They also went through a virtual infrastructure platform (code) upgrade as well as software upgrades to the storage and compute platforms. So this is the usual story…they made a large purchase (actually it was an operational lease, ironically like cloud could be), then added to it, and spent a ton of time and man hours doing engineering around the infrastructure just to maintain status quo. I can quantify the infrastructure but not the man hours, but I’m sure you know what I’m talking about.
Four years later guess what’s happening – they have to go through it all over again! They need to refresh their SAN and basically redeploy everything – migrate all the data off, , validate, etc. And how much is all of this? 6 to 7 million, plus a massive amount of services and about 4 months of project execution. To be fair, they grew over 100%, made some acquisitions and some of their stuff has to be within their own data center. However, there are hidden costs here in my opinion. 1) Technology manufacturers have got customers into this cycle of doing a refresh every 3 years. How? They bake the support (3 years’ worth) into the initial purchase so there is no operational expense. Then after 3 years, maintenance kicks in which becomes very expensive, and they just run a spreadsheet showing how if they just refresh they avoid “x” dollars in maintenance and how it’s worth it to just get new technology. Somehow that approach still works. There are massive amounts of professional services to executive the migration, a multi-month disruption to business, and no innovation from the IT department. It’s maintaining status quo. The only reduction that can be realized on this regard are hardware and software decreases over time, which are historically based on Moore’s law. Do you want your IT budget and staff at the mercy of Moore’s law and technology manufacturers that use funky accounting to show you “savings”?
Now let’s look at the other side, and let’s be fair. In cloud there can be hidden costs, but they exist in my mind only if you do one thing, forget about planning. Even with cloud you need to take the same approach in doing a plan, design, build, migrate, and manage methodology to your IT infrastructure. Just because cloud is easy to deploy doesn’t mean you should forget about the steps you normally take. But that isn’t a problem with cloud. It’s a problem with how people deploy into the cloud, and that’s an easy fix. If you control your methodology there should be no hidden costs because you properly planned, architected and built your cloud infrastructure. In theory this is true, but let’s look at the other side people fail to highlight…the hidden SAVINGS!!
With Amazon Web Services there have been 37 price reductions in the 7 years they have been selling their cloud platform. That’s a little more than 5/year. Do you get that on an ongoing basis after you spend 5 million on traditional infrastructure? With this approach, once you sign up you are almost guaranteed to get a credit as some point in the lifecycle of your cloud infrastructure, and those price reductions are not based on Moore’s law. Those price reductions are based on AWS having very much the same approach to their cloud as they do their retail business. Amazon wants to extend the value to customers that exists because of their size and scale, and they set margin limits on their services. Once they are “making too much” on a service or product they cut the cost. So as they grow and become more efficient and gain more market share with their cloud business, you save more!
Another bonus is that there are no refresh cycles or migration efforts every 3 years. Once you migrate to the cloud AWS manages all the infrastructure migration efforts. You don’t have to worry about your storage platform or your virtual infrastructure. Everything from the hypervisor down is on AWS, and you manage your operating system and application. What does that mean? You are not incurring massive services every 3-4 years for a 3rd party to help you design/build/migrate your stuff, and you aren’t spending 3-4 months every few years on a disruption to your business and your staff not innovating.
-David Stewart, Solutions Architect