CloudWatch is a tool for monitoring Amazon Web Services (AWS) cloud resources. With CloudWatch you can gather and monitor metrics for many of your AWS assets. CloudWatch for AWS EC2 allows 10 pre-selected metrics that are polled at five minute frequencies. These pre-selected metrics include CPU Utilization, Disk Reads, Disk Read Operations, Disk Writes, Disk Write Operations, Network-In, Network-Out, Status Check Failed (Any), Status Check Failed (Instance), and Status Check Failed (System). These metrics are designed to give you the most relevant information to help keep your environment running smoothly. CloudWatch goes one step further and offers seven pre-selected metrics that poll at an increased frequency of one-minute intervals for an additional charge. With CloudWatch you can set alarms based on thresholds set on any of your metrics. The alarms can trigger you to receive status notifications or to have the environment take automated action. For example you can set an alarm to notify you if one of your instances is experiencing high CPU load. As you can see from the graph below we’re using CloudWatch to gain insight on an instance’s average CPU Utilization over a period of 1 hour at 5 minute intervals:
You can clearly see that at 19:10 the CPU Utilization is at zero and then spikes over the next 35 minutes and is at 100% CPU utilization. 100% CPU utilization lasts for longer than 10 minutes. Without any monitoring this could be a real problem as the CPU of the system is being completely taxed, and performance would undoubtedly become sluggish. If this was a webserver, users would experience dropped connections, timeouts, or very slow response times. In this example it doesn’t matter what is causing the CPU spike, it matters how you would deal with it. If this happened in the middle of the night you would experience downtime and a disruption to your business. With a lot riding on uninterrupted 24×7 operations, processes must be in place to withstand unexpected events like this. With CloudWatch, AWS makes monitoring a little easier and setting alarms based on resource thresholds simple. Here is one way to do it for our previous CPU Utilization example:
1. Go to https://console.aws.amazon.com/cloudwatch/
2. In the Dashboard go to Metrics and select the instance and metric name in question. On the Right side of the screen you should also see a button that says Create Alarm. (See figure below)
3. Once you hit Create Alarm, the page will allow you to set an Alarm Threshold based on parameters that you choose. We’ll call our threshold “High CPU” and give it a description “Alarm when CPU is 85% for 10 minutes or more”.
4. Additionally you have to set the parameters to trigger the alarm. We choose “Whenever CPU Utilization is 85% for 2 consecutive periods” (remember our periods are 5 minutes each). This means after 10 minutes in an alarm state our action will take place.
5. For Actions we select “Whenever this alarm: State is ALARM” send notification to our SNS Topic MyHighCPU and send an email. This will cause the trigger to send an email to an email address or distribution list. (See the figure below)
6. Finally we hit Create Alarm, and we get the following:
7. Finally you have to go to the email account of the address you entered and confirm the SNS Notification subscription. You should see a message that says: “You have chosen to subscribe to the topic: arn:aws:sns:us-west-1:xxxxxxxxxxxxx:MyHighCPU. To confirm this subscription, click or visit the link below (If this was in error no action is necessary). Confirm subscription.
Overall the process of creating alarms for a couple metrics is pretty straight forward and simple. It can get more complex when you incorporate more complex logic. For example you could have a couple EC2 instances in an Auto Scale Group behind an Elastic Load Balancer, and if CPU spiked over 85% for 10 minutes you could have the Auto Scale Group take immediate automated action to spin up additional EC2 instances to take on the increased load. When that presumed web traffic that was causing the CPU spike subsides you can have a trigger that scales back instances so you are no longer paying for them. With the power of CloudWatch managing your AWS systems can become completely automated, and you can react immediately to any problems or changing conditions.
In many environments the act of monitoring and managing systems can become complicated and burdensome leaving you little time for developing your website or application. At 2nd Watch we provide a suite of managed services (http://2ndwatch.com/cloud-services/managed-cloud-platform/) to help you free up time for more important aspects of your business. We can put much of the complex logic in place for you to help minimize your administrative cloud burden. We take a lot of the headache out of managing your own systems and ensure that your operations are secure, reliable, and compliant at the lowest possible cost.
-Derek Baltazar, Senior Cloud Engineer
The pervasive technology industry has created the cloud and all the acronyms that go with it. Growth is fun, and the cloud is the talk of the town. From the California Sun to the Kentucky coal mines we are going to the cloud, although Janis Joplin may have been there before her time. Focus and clarity will come later.
There is so much data being stored today that the biggest challenge is going to be how to quantify it, store it, access it and recover it. Cloud-based disaster recovery has broad-based appeal across industry and segment size. Using a service from the AWS cloud enables more efficient disaster recovery of mission critical applications without any upfront cost or commitment. AWS allows customers to provision virtual private clouds using its infrastructure, which offers complete network isolation and security. The cloud can be used to configure a “pilot-light” architecture, which dramatically reduces cost over traditional data centers where the concept of “pilot” or “warm” is not an option – you pay for continual use of your infrastructure whether it’s used or not. With AWS, you only use what you pay for, and you have complete control of your data and its security.
Backing data up is relatively simple: select an object to be backed up and click a button. More often than not, the encrypted data reaches its destination, whether in a local storage device or to an S3 bucket in an AWS region in Ireland. Restoring the data has always been a perpetual challenge. What the cloud does is make ing of the backup capabilities more flexible and more cost effective. As the cost of cloud-based ing falls rapidly, from thousands of dollars or dinars, to hundreds, it results in more ing, and therefore, more success after a failure whether it’s from a superstore or superstorm, or even a supermodel one.
-Nick Desai, Solutions Architect
October 28th marked the one year anniversary of Hurricane Sandy, the epic storm that ravaged the Northeastern part of the United States. Living in NJ where the hurricane made landfall and having family that lives across much of the state we personally lived through the hurricane and its aftermath. It’s hard to believe that it’s been a year already. It’s an experience we’ll never forget, and we have made plans to ensure that we’re prepared in case anything like that happens again. Business mirrors Life in many cases, and when I speak with customers across the country the topic of disaster recovery comes up often. The conversations typically have the following predictable patterns:
- I’ve just inherited the technology and systems of company X (we’ll call it company X to protect the innocent), and we have absolutely no backup or disaster recovery strategy at all. Can you help me?
- We had a disaster recovery strategy, but we haven’t really looked at it in a very long time, I’ve heard Cloud Computing can help me. Is that true?
- We have a disaster recovery approach we’re thinking about. Can you review it and validate that we’re leveraging best practices?
- I’m spending a fortune on disaster recovery gear that just sits idle 95% of the time. There has to be a better way.
The list is endless with permutations, and yes there is a better way. Disaster recovery as a workload is a very common one for a Cloud Computing solution, and there’s a number of ways you can approach it. As with anything there are tradeoffs of cost vs. functionality and typically depends on the business requirements. For example a full active/active environment where you need complete redundancy and sub second failover can be costly but potentially necessary depending on your business requirements. In the Financial Services industry for example, having revenue generating systems down for even a few seconds can cost a company millions of dollars.
We have helped companies of all sizes think about, design and implement disaster recovery strategies. From Pilot Lights, where there’s just the glimmer of an environment, to warm standby’s to fully redundant systems. The first step is to plan for the future and not wait until it’s too late.
-Mike Triolo, General Manager East
Amazon and 2nd Watch have published numerous white papers and blog articles on various ways to use Amazon Web Services™ (AWS) for a disaster recovery strategy. And there is no doubt at all that AWS is an excellent place to run a disaster recovery environment for on premise data centers and save companies enormous amounts of capital while preserving their business with the security of a full DR plan. For more on this, I’ll refer you to our DR is Dead article as an example of how this works.
What happens though, when you truly cannot have any downtime for your systems or a subset of your systems? Considering recent events like Hurricanes Sandy and Katrina, how do critical systems use AWS for DR when Internet connectivity cannot be guaranteed? How can cities prone to earthquakes justify putting DR systems in the Cloud when true disasters they have to consider involve large craters and severed fiber cables? Suddenly having multiple Internet providers doesn’t seem quite enough redundancy when all systems are Cloud based. Now to be fair, in such major catastrophes most users have greater concerns than ‘can I get my email?’ or ‘where’s that TPS report?’ but what if your systems are supporting first responders? DR has an even greater level of importance.
Typically, this situation is what keeps systems that have links to first responders, medical providers, and government from adopting a Cloud strategy or Cloud DR strategy. This is where a Reverse DR strategy has merit: moving production systems into AWS but keeping a pilot light environment on premise. I won’t reiterate the benefits of moving to AWS, there are more articles on this than I can possibly reference (but please, contact the 2nd Watch sales team and they’ll be happy to expound upon the benefits!) or rehash Ryan’s article on DR is Dead. What I will say is this: if you can move to AWS without risking those critical disaster response systems, why aren’t you?
By following the pilot light model in reverse, customers can leave enough on premise to keep the lights on in the event of disaster. With regularly scheduled s to make sure those on premise systems are running and sufficient for emergencies, customers can take advantage of the Cloud for a significant portion of their environments. From my experiences, once an assessment is completed to validate which systems are required on premise to support enough staff in the event of a disaster, most customers find themselves able to move 90%+ of their environment to the Cloud, save a considerable amount of money, and suffer no loss of functionality.
So put everything you’ve been told about DR in the Cloud in reverse, move your production environments to AWS and leave just enough to handle those pesky hurricanes on premise, and you’ve got yourself a reverse DR strategy using AWS.
-Keith Homewood, Cloud Architect
Databases tend to host the most critical data your business has. From orders, customers, products and even employee information – it’s everything that your business depends on. How much of that can you afford to lose?
With AWS you have options for database recovery depending on your budget and Recovery Time Objective (RTO).
Low budget/Long RTO:
- Whether you are in the cloud or on premise, using the AWS Command Line Interface (CLI) tools you can script uploads of your database backups directly to S3. This can be added as a step to an existing backup job or an entirely new job.
- Another option would be to use a third party tool to mount an S3 bucket as a drive. It’s possible to backup directly to the S3 bucket, but if you have write issues you may need to write the backup locally and then move it to the mounted drive.
These methods have a longer RTO as they will require you to stand up a new DB server and then restore the backups, but is a low cost solution to ensure you can recover your business. The catch here is that you can only restore to the last backup that you have taken and copied to S3. You may want to review you backup plans to ensure you are comfortable with what you may lose. Just make sure you use the native S3 lifecycle policies to purge old backups otherwise your storage bill will slowly get out of hand.
High budget/short RTO:
- Almost all mainstream Relational Database Management Systems (RDBMS) have a native method of replication. You can setup an EC2 Instance database server to replicate your database to. This can be in real-time so that you can be positive that you will not lose a single transaction.
- What about RDS? While you cannot use native RDBMS replication there are third party replication tools that will do Change Data Capture (CDC) replication directly to RDS. These can be easier to setup than the native replication methods, but you will want to make sure you are monitoring these tools to ensure that you do not get into a situation where you can lose transactional data.
Since this is DR you can lower the cost of these solutions by downsizing the RDS or EC2 instance. This will increase the RTO as you will need to manually resize the instances in the event of failure, but can be a significant cost saver. Both of these solutions will require connectivity to the instance over VPN or Direct Connect.
Another benefit of this solution is that it can easily be utilized for QA, Testing and development needs. You can easily snapshot the RDS or EC2 instance and stand up a new one to work against. When you are done – just terminate it.
With all database DR solutions, make sure you script out the permissions & server configurations. This either needs to be saved off with the backups or applied to RDS/EC2 instances. These are constantly changing and can create recovery issues if you do not account for them.
With an AWS database recovery plan you can avoid losing critical business data.
-Mike Izumi, Cloud Architect
Backup and disaster recovery often require solutions that add complexity and additional cost to properly synchronize your data and systems. Amazon Web Services™ (AWS) helps drive this cost and complexity with a number of services. Amazon S3 provides a highly durable (99.999999999%) storage platform for your backups. This service backs up your data to multiple availability zones (AZ) to provide you the ultimate peace of mind for your data. AWS also provides an ultra-low cost service for long-term cold storage that is aptly named Glacier. At $0.01 per GB / month this service will force you to ask, “Why am I not using AWS today?”
AWS has developed the AWS Storage Gateway to make your backups secure and efficient. For only $125 per backup location per month, you will have a robust solution that provides the following features:
- Secure transfers of all data to AWS S3 storage
- Compatible with your current architecture – there is no need to call up your local storage vendor for a special adapter or firmware version to use Storage Gateway
- Designed for AWS – this provides a seamless integration of your current environment to AWS services
AWS Storage Gateway and Amazon EC2 (snapshots of machine images) together provide a simple cloud-hosted DR solution. Amazon EC2 allows you to quickly launch images of your production environment in AWS when you need them. The AWS Storage Gateway seamlessly orchestrates with S3 to provide you a robust backup and disaster recovery solution that meets anyone’s budget.
-Matt Whitney, Sales Executive