1-888-317-7920 info@2ndwatch.com

How to Achieve Redundancy for High Availability in the Cloud

In the last of our four-part blog series with our strategic partner, Alert Logic, we explore business resumption for cloud environments. Check out last week’s article on Free Tools and Tips for Testing the Security of Your Environment Against Attacks first.

Business resumption, also known as disaster recovery, has always been a challenge for organizations. Aside from those in the banking and investment industry, many businesses don’t take business resumption as seriously as they should.

I formerly worked at a financial institution that would send their teams to another city in another state where production data was backed up and could be restored in the event of a disaster. Employees would go to this location and use the systems in production to complete their daily workloads. This would the redundancy of a single site, but what if you could have many redundant sites? What if you could have a global backup option and have redundancy not only when you need it, but as a daily part of your business strategy?

To achieve true redundancy, I recommend understanding your service provider’s offerings. Each service provider has different facilities located in different regions that are spread between different telecom service providers.

From a customer’s perspective, this creates a good opportunity to build out an infrastructure that has fully redundant load balances, giving your business a regional presence in almost every part of the world. In addition, you are able to deliver application speed and efficiency to your regional consumers.

Look closely at your provider’s services like hardware health monitoring, log management, security monitoring and all the management services that accompany those solutions.  If you need to conform to certain compliance regulations, you also need to make sure the services and technologies meet each regulation.

Organize your vendors and managed service providers so that you can get your data centralized based on service across all providers and all layers of the stack. This is when you need to make sure that your partners share data, have the ability to ingest logs, and exchange APIs with each other to effectively secure your environment.

Additionally, centralize the notification process so you are getting one call per incident versus multiple calls across providers. This means that API connectivity or log collection needs to happen between technologies that are correlating triggered events across multiple platforms. This will centralize your notification and increase the efficiency and decrease detection time to mitigate risks introduced into your environment by outside and inside influences.

Lastly, to find incidents as quickly as possible, you need to find a managed services provider that will be able to ingest and correlate all events and logs across all infrastructures. There are also cloud migration services that will help you with all these decisions as they help move you to the cloud.

Learn more about 2W Managed Cloud Security and how our partnership with Alert Logic can ensure your environment’s security

Article contributed by Alert Logic



Managing Your Amazon Cloud Deployment to Save Money

Wired Innovation Insights published a blog article written by our own Chris Nolan yesterday. Chris discusses ways you can save money on your AWS cloud deployment in “How to Manage Your Amazon Cloud Deployment to Save Money.” Chris’ top tips incude:

  1. Use CloudFormation or other configuration and orchestration  tool.
  2. Watch out for cloud sprawl.
  3. Use AWS auto scaling.
  4. Turn the lights off when you leave the room.
  5. Use tools to monitor spend.
  6. Build in redundancy.
  7. Planning saves money.

Read the Full Article


AWS Auto-Scaling

Auto-Scaling gives the ability to scale your EC2 instances up or down according to demand to handle the load on the service.  With auto-scaling you don’t have to worry about whether or not the number of instances you’re using will be able to handle a demand spike or if you’re overspending during a slower period. Auto-scaling automatically scales for you for seamless performance.

For instance, if there are currently 3 m1.xlarge instances handling the service, and they spend a large portion of their time only 20% loaded with a smaller portion of their time heavily loaded, they can be vertically scaled down to smaller instance sizes and horizontally scaled out/in to more or less instances to automatically accommodate whatever load they have at that time.  This can also save many dollars by only paying for the smaller instance size.  More savings can be attained by using reserved instance billing for the minimum number of instances defined by the Auto-Scaling configuration and letting those scaled out instances pay the on-demand rate while running.  This is a little tricky though because an instance billing cannot be changed while the instance is running.  When scaling down, make sure to terminate the newest instances, since they are running at the on-demand billing rate.

Vertical Scaling is typically referred to as scale-up or scale-down by changing the instance size, while Horizontal Scaling is typically referred to as scale-out or scale-in by changing the number of instances.

When traffic on AWS Service has predictable or unpredictable increases or decreases, Auto-Scaling can keep customers happy with the service because their response times stay more consistent and High Availability is more reliable.

Auto-Scaling to Improve HA

If there is only one server instance, Auto-scaling can be used to put a new server in place, in a few minutes, when the running one fails.  Just set both Min and Max number of instances to 1.

Auto-Scaling to Improve Response Time Consistency

If there are multiple servers and the load on them becomes so heavy that the response time slows, expand horizontally only for the time necessary to cover the extra load, and keep the response time low.

AWS Auto-Scaling Options to Set

When Auto-Scaling up or down, there are a lot of things to think about:

  • Evaluation Period is the time, in seconds, between checks of the load on the Scaling Group.
  • Cool Down is the time, in seconds, after a scaling operation that a new scaling operation can be performed.  When scaling out, this time should be fairly short in the event that the load is too heavy for one Scale-Up operation. When scaling in, this time should be at least twice that of the Scale-Out operation.
  • With Scale-Out, make sure it scales fast enough to quickly handle a load heavier than one expansion. 300 seconds is a good starting point.
  • With Scale-In, make sure it scales slow enough to not keep going out and in.  We call this “Flapping”. Some call it “Thrashing”.
  • When the Auto-Scale Group includes multiple AZs, Scaling out and in should be incremented by the number of AZs involved. If only one AZ is scaled up and something happens to that AZ, noticeability in a bad way goes up.
  • Scale-In can be accomplished by different rules:
  1. Terminate Oldest Instance
  2. Terminate Newest Instance
  3. Terminate Instance Closest to the next Instance Hour (Best Cost Savings)
  4. Terminate Oldest Launch Configuration (default)

Auto-Scaling Examples

Auto-Scaling is a two stage process, and here is the rub.  The AWS Management Console does not do Auto-Scaling so it has to be done through AWS APIs.

  1. Set up the Launch Configuration and assign it to a group of instances you want to control.  If there is no user_data file that argument can be left out.  The block-device-mapping argument can be found in the details for the ami_id.
    • # as-create-launch-config <auto_scaling_launch_config_name> –region <region_name> –image-id <AMI_ID> –instance-type <type> –key <SSH_key_pair_name> –group <VPC_security_group_ID> –monitoring-enabled –user-data-file=<path_and_name_for_user_data_file> –block-device-mapping “<device_name>=<snap_id>:100:true:standard”
    • # as-create-auto-scaling-group <auto_scaling_group_name> –region <region_name> –launch-configuration <auto_scaling_launch_config_name> –vpc-zone-identifier <VPC_Subnet_ID>,<VPC_Subnet_ID> –availability-zones <Availability_Zone>,<Availability_Zone> –load-balancers <load_balancer_name> –min-size <min_number_of_instances_that_must_be_running> –max-size <max_number_of_instances_that_can_be_running> –health-check-type ELB –grace-period <time_seconds_before_first_check> –tag “k=Name, v=<friendly_name>, p=true”
  2. Have CloudWatch initiate Scaling Activities.  One CloudWatch Alert for Scaling Out and one for Scaling In.  Also send notifications when scaling.
  • Scale Out (Alarm Actions output from first command are used by second command argument)
  • # as-put-scaling-policy –name <auto_scaling_policy_name_for_high_CPU> –region <region_name> –auto-scaling-group <auto_scaling_group_name> –adjustment <Number_of_instances_to_change_by> –type ChangeInCapacity –cooldown <time_in_seconds_to_wait_to_check_after_adding_instances>
  • # mon-put-metric-alarm –alarm-name <alarm_name_for_high_CPU> –region <region_name>  –metric-name CPUUtilization –namespace AWS/EC2 –statistic Average –period <number_of_seconds_to_check_each_time_period> –evaluation-periods <number_of_periods_between_checks> –threshold <percent_number> –unit Percent –comparison-operator GreaterThanThreshold –alarm-description <description_use_alarm_name> –dimensions “AutoScalingGroupName=<auto_scaling_group_name>” –alarm-actions <arn_string_from_last_command>
  • Scale In(Alarm Actions output from first command used as second command argument)
  • # as-put-scaling-policy –name <auto_scaling_policy_name_for_low_CPU> –region <region_name> –auto-scaling-group <auto_scaling_group_name> “–adjustment=-<Number_of_instances_to_change_by> ” –type ChangeInCapacity –cooldown <time_in_seconds_to_wait_to_check_after_removing_instances>
  • # mon-put-metric-alarm –alarm-name <alarm_name_for_low_CPU> –region <region_name> –metric-name CPUUtilization –namespace AWS/EC2 –statistic Average –period <number_of_seconds_to_check_each_time_period>  –evaluation-periods <number_of_periods_between_checks>  –threshold <percent_number> –unit Percent –comparison-operator LessThanThreshold –alarm-description <description_use_alarm_name> –dimensions “AutoScalingGroupName=<auto_scaling_group_name>” –alarm-actions <arn_string_from_last_command>

AMI Changes Require Auto-Scaling Updates

The instance configuration could change for any number of reasons:

  • Security Patches
  • New Features added
  • Removal of un-used Old Features

Whenever the AMI specified in the Auto-Scaling definition is changed, the Auto-Scaling Group needs to be updated.  The update requires creating a new Scaling Launch Config with the new AMI ID, updating the Auto-Scaling Group, then deleting the old Scaling Launch Config.  Without this update the Scale out operation will use the old AMI.

1. Create new Launch Config:

# as-create-launch-config <new_auto_scaling_launch_config_name> –region <region_name> –image-id <AMI_ID> –instance-type <type> –key <SSH_key_pair_name> –group <VPC_security_group_ID> –monitoring-enabled –user-data-file=<path_and_name_for_user_data_file> –block-device-mapping “<device_name>=<snap_id>:100:true:standard”

2. Update Auto Scaling Group:

# as-update-auto-scaling-group  <auto_scaling_group_name> –region <region_name> –launch-configuration <new_auto_scaling_launch_config_name> –vpc-zone-identifier <VPC_Subnet_ID>,<VPC_Subnet_ID> –availability-zones <Availability_Zone>,<Availability_Zone> –min-size <min_number_of_instances_that_must_be_running> –max-size <max_number_of_instances_that_can_be_running> –health-check-type ELB –grace-period <time_seconds_before_first_check>

3. Delete Old Auto-Scaling Group:

as-delete-launch-config <old_auto_scaling_launch_config_name> –region <region_name> –force

Now all Scale Outs should use the updated AMI.

-Charles Keagle, Senior Cloud Engineer


Highly Available (HA) vs. Highly Reliable (HR)

The other day I was working with my neighbor’s kid on his baseball fundamentals and I found myself repeating the phrase “Remember the Basics.”  Over the course of the afternoon we played catch, worked on swinging the bat, and fielded grounders until our hands hurt. As the afternoon slipped into the evening hours, I started to see that baseball and my business have several similarities.

My business is Cloud Computing, and my company, 2nd Watch, is working to pioneer Cloud Adoption with Enterprise businesses.  As we discover new ways of integrating systems, performing workloads, and running applications, it is important for us not to forget the fundamentals. One of the most basic elements of this is using the proper terminology. I’ve found that in many cases my customers, partners, and even my colleagues can have different meanings for many of the same terms. One example that comes up frequently is the difference between having a Highly Available infrastructure vs. Highly Reliable infrastructure. I want to bring special attention to these two terms and help to clearly define their meaning.

High Availability (HA) is based on designing and implementing systems that are proactively created to handle the operational capacity to meet their required performance. For example, within Cloud Computing we leverage tools like Elastic Load Balancing and Auto Scaling to automate the scaling of infrastructure to handle the variable demand for web sites. As traffic increases, servers are spun up to handle the load and vice versa as it decreases.  If a user cannot access a website or it is deemed “unavailable,” then you risk the potential loss of readership, sales, or the attrition of customers.

On the other hand, Highly Reliable (HR) systems have to do with your Disaster Recovery (DR) model and how well you prepare for catastrophic events. In Cloud Computing, we design for failure because anything can happen at any time. Having a proper Disaster Recovery plan in place will enable your business to keep running if problems arise. Any company with sensitive IT operations should look into a proper DR strategy, which will support their company’s ability to be resilient in the event of failure. While a well-planned DR schema may cost you more money upfront, being able to support both your internal and external customers will pay off in spades if an event takes place that requires you to fail over.

In today’s business market it is important to take the assumptions out of our day-to-day conversations and make sure that we’re all on the same page. The difference between being Highly Available and Highly Reliable systems is a great example of this. By simply going back to the fundamentals, we can easily make sure that our expectations are met and our colleagues, partners, and clients understand both our spoken & written words.

-Blake Diers, Cloud Sales Executive