IT infrastructure is the hardware, network, services and software required for enterprise IT. It is the foundation that enables organizations to deliver IT services to their users. Disaster recovery (DR) is preparing for and recovering from natural and people-related disasters that impact IT infrastructure for critical business functions. Natural disasters include earthquakes, fires, etc. People-related disasters include human error, terrorism, etc. Business continuity differs from DR as it involves keeping all aspects of the organization functioning, not just IT infrastructure.
When planning for DR, companies must establish a recovery time objective (RTO) and recovery point objective (RPO) for each critical IT service. RTO is the acceptable amount of time in which an IT service must be restored. RPO is the acceptable amount of data loss measured in time. Companies establish both RTOs and RPOs to mitigate financial and other types of loss to the business. Companies then design and implement DR plans to effectively and efficiently recover the IT infrastructure necessary to run critical business functions.
For companies with corporate datacenters, the traditional approach to DR involves duplicating IT infrastructure at a secondary location to ensure available capacity in a disaster. The key downside is IT infrastructure must be bought, installed and maintained in advance to address anticipated capacity requirements. This often causes IT infrastructure in the secondary location to be over-procured and under-utilized. In contrast, Amazon Web Services (AWS) provides companies with access to enterprise-grade IT infrastructure that can be scaled up or down for DR as needed.
The four most common DR architectures on AWS are:
- Backup and Restore ($) – Companies can use their current backup software to replicate data into AWS. Companies use Amazon S3 for short-term archiving and Amazon Glacier for long-term archiving. In the event of a disaster, data can be made available on AWS infrastructure or restored from the cloud back onto an on-premise server.
- Pilot Light ($$) – While backup and restore are focused on data, pilot light includes applications. Companies only provision core infrastructure needed for critical applications. When disaster strikes, Amazon Machine Images (AMIs) and other automation services are used to quickly provision the remaining environment for production.
- Warm Standby ($$$) – Taking the Pilot Light model one step further, warm standby creates an active/passive cluster. The minimum amount of capacity is provisioned in AWS. When needed, the environment rapidly scales up to meet full production demands. Companies receive (near) 100% uptime and (near) no downtime.
- Hot Standby ($$$$) – Hot standby is an active/active cluster with both cloud and on-premise components to it. Using weighted DNS load-balancing, IT determines how much application traffic to process in-house and on AWS. If a disaster or spike in load occurs, more or all of it can be routed to AWS with auto-scaling.
In a non-disaster environment, warm standby DR is not scaled for full production, but is fully functional. To help adsorb/justify cost, companies can use the DR site for non-production work, such as quality assurance, ing, etc. For hot standby DR, cost is determined by how much production traffic is handled by AWS in normal operation. In the recovery phase, companies only pay for what they use in addition and for the duration the DR site is at full scale. In hot standby, companies can further reduce the costs of their “always on” AWS servers with Reserved Instances (RIs).
Smart companies know disaster is not a matter of if, but when. According to a study done by the University of Oregon, every dollar spent on hazard mitigation, including DR, saves companies four dollars in recovery and response costs. In addition to cost savings, smart companies also view DR as critical to their survival. For example, 51% of companies that experienced a major data loss closed within two years (Source: Gartner), and 44% of companies that experienced a major fire never re-opened (Source: EBM). Again, disaster is not a ready of if, but when. Be ready.
-Josh Lowry, General Manager – West
The jump to the cloud can be a scary proposition. For an enterprise with systems deeply embedded in traditional infrastructure like back office computer rooms and datacenters the move to the cloud can be daunting. The thought of having all of your data in someone else’s hands can make some IT admins cringe. However, once you start looking into cloud technologies you start seeing some of the great benefits, especially with providers like Amazon Web Services (AWS). The cloud can be cost-effective, elastic and scalable, flexible, and secure. That same IT admin cringing at the thought of their data in someone else’s hands may finally realize that AWS is a bit more secure than a computer rack sitting under an employee’s desk in a remote office. Once the decision is finally made to “try out” the cloud, the planning phase can begin.
Most of the time the biggest question is, “How do we start with the cloud?” The answer is to use a phased approach. By picking applications and workloads that are less mission critical, you can try the newest cloud technologies with less risk. When deciding which workloads to move, you should ask yourself the following questions; Is there a business need for moving this workload to the cloud? Is the technology a natural fit for the cloud? What impact will this have on the business? If all those questions are suitably answered, your workloads will be successful in the cloud.
One great place to start is with archiving and backups. These types of workloads are important, but the data you’re dealing with is likely just a copy of data you already have, so it is considerably less risky. The easiest way to start with archives and backups is to try out S3 and Glacier. Many of today’s backup utilities you may already be using, like Symantec Netbackup and Veeam Backup & Replication, have cloud versions that can directly backup to AWS. This allows you to use start using the cloud without changing much of your embedded backup processes. By moving less critical workloads you are taking the first steps in increasing your cloud footprint.
Now that you have moved your backups to AWS using S3 and Glacier, what’s next? The next logical step would be to try some of the other services AWS offers. Another workload that can often be moved to the cloud is Disaster Recovery. DR is an area that will allow you to more AWS services like VPC, EC2, EBS, RDS, Route53 and ELBs. DR is a perfect way to increase your cloud footprint because it will allow you to construct your current environment, which you should already be very familiar with, in the cloud. A Pilot Light DR solution is one type of DR solution commonly seen in AWS. In the Pilot Light scenario the DR site has minimal systems and resources with the core elements already configured to enable rapid recovery once a disaster happens. To build a Pilot Light DR solution you would create the AWS network infrastructure (VPC), deploy the core AWS building blocks needed for the minimal Pilot Light configuration (EC2, EBS, RDS, and ELBs), and determine the process for recovery (Route53). When it is time for recovery all the other components can be quickly provisioned to give you a fully working environment. By moving DR to the cloud you’ve increased your cloud footprint even more and are on your way to cloud domination!
The next logical step is to move Test and Dev environments into the cloud. Here you can get creative with the way you use the AWS technologies. When building systems on AWS make sure to follow the Architecting Best Practices: Designing for failure means nothing will fail, decouple your components, take advantage of elasticity, build security into every layer, think parallel, and don’t fear constraints! Start with proof-of-concept (POC) to the development environment, and use AWS reference architecture to aid in the learning and planning process. Next your legacy application in the new environment and migrate data. The POC is not complete until you validate that it works and performance is to your expectations. Once you get to this point, you can reevaluate the build and optimize it to exact specifications needed. Finally, you’re one step closer to deploying actual production workloads to the cloud!
Production workloads are obviously the most important, but with the phased approach you’ve taken to increase your cloud footprint, it’s not that far of a jump from the other workloads you now have running in AWS. Some of the important things to remember to be successful with AWS include being aware of the rapid pace of the technology (this includes improved services and price drops), that security is your responsibility as well as Amazon’s, and that there isn’t a one-size-fits-all solution. Lastly, all workloads you implement in the cloud should still have stringent security and comprehensive monitoring as you would on any of your on-premises systems.
Overall, a phased approach is a great way to start using AWS. Start with simple services and traditional workloads that have a natural fit for AWS (e.g. backups and archiving). Next, start to explore other AWS services by building out environments that are familiar to you (e.g. DR). Finally, experiment with POCs and the entire gambit of AWS to benefit for more efficient production operations. Like many new technologies it takes time for adoption. By increasing your cloud footprint over time you can set expectations for cloud technologies in your enterprise and make it a more comfortable proposition for all.
-Derek Baltazar, Senior Cloud Engineer
October 28th marked the one year anniversary of Hurricane Sandy, the epic storm that ravaged the Northeastern part of the United States. Living in NJ where the hurricane made landfall and having family that lives across much of the state we personally lived through the hurricane and its aftermath. It’s hard to believe that it’s been a year already. It’s an experience we’ll never forget, and we have made plans to ensure that we’re prepared in case anything like that happens again. Business mirrors Life in many cases, and when I speak with customers across the country the topic of disaster recovery comes up often. The conversations typically have the following predictable patterns:
- I’ve just inherited the technology and systems of company X (we’ll call it company X to protect the innocent), and we have absolutely no backup or disaster recovery strategy at all. Can you help me?
- We had a disaster recovery strategy, but we haven’t really looked at it in a very long time, I’ve heard Cloud Computing can help me. Is that true?
- We have a disaster recovery approach we’re thinking about. Can you review it and validate that we’re leveraging best practices?
- I’m spending a fortune on disaster recovery gear that just sits idle 95% of the time. There has to be a better way.
The list is endless with permutations, and yes there is a better way. Disaster recovery as a workload is a very common one for a Cloud Computing solution, and there’s a number of ways you can approach it. As with anything there are tradeoffs of cost vs. functionality and typically depends on the business requirements. For example a full active/active environment where you need complete redundancy and sub second failover can be costly but potentially necessary depending on your business requirements. In the Financial Services industry for example, having revenue generating systems down for even a few seconds can cost a company millions of dollars.
We have helped companies of all sizes think about, design and implement disaster recovery strategies. From Pilot Lights, where there’s just the glimmer of an environment, to warm standby’s to fully redundant systems. The first step is to plan for the future and not wait until it’s too late.
-Mike Triolo, General Manager East
Amazon and 2nd Watch have published numerous white papers and blog articles on various ways to use Amazon Web Services™ (AWS) for a disaster recovery strategy. And there is no doubt at all that AWS is an excellent place to run a disaster recovery environment for on premise data centers and save companies enormous amounts of capital while preserving their business with the security of a full DR plan. For more on this, I’ll refer you to our DR is Dead article as an example of how this works.
What happens though, when you truly cannot have any downtime for your systems or a subset of your systems? Considering recent events like Hurricanes Sandy and Katrina, how do critical systems use AWS for DR when Internet connectivity cannot be guaranteed? How can cities prone to earthquakes justify putting DR systems in the Cloud when true disasters they have to consider involve large craters and severed fiber cables? Suddenly having multiple Internet providers doesn’t seem quite enough redundancy when all systems are Cloud based. Now to be fair, in such major catastrophes most users have greater concerns than ‘can I get my email?’ or ‘where’s that TPS report?’ but what if your systems are supporting first responders? DR has an even greater level of importance.
Typically, this situation is what keeps systems that have links to first responders, medical providers, and government from adopting a Cloud strategy or Cloud DR strategy. This is where a Reverse DR strategy has merit: moving production systems into AWS but keeping a pilot light environment on premise. I won’t reiterate the benefits of moving to AWS, there are more articles on this than I can possibly reference (but please, contact the 2nd Watch sales team and they’ll be happy to expound upon the benefits!) or rehash Ryan’s article on DR is Dead. What I will say is this: if you can move to AWS without risking those critical disaster response systems, why aren’t you?
By following the pilot light model in reverse, customers can leave enough on premise to keep the lights on in the event of disaster. With regularly scheduled s to make sure those on premise systems are running and sufficient for emergencies, customers can take advantage of the Cloud for a significant portion of their environments. From my experiences, once an assessment is completed to validate which systems are required on premise to support enough staff in the event of a disaster, most customers find themselves able to move 90%+ of their environment to the Cloud, save a considerable amount of money, and suffer no loss of functionality.
So put everything you’ve been told about DR in the Cloud in reverse, move your production environments to AWS and leave just enough to handle those pesky hurricanes on premise, and you’ve got yourself a reverse DR strategy using AWS.
-Keith Homewood, Cloud Architect