October 28th marked the one year anniversary of Hurricane Sandy, the epic storm that ravaged the Northeastern part of the United States. Living in NJ where the hurricane made landfall and having family that lives across much of the state we personally lived through the hurricane and its aftermath. It’s hard to believe that it’s been a year already. It’s an experience we’ll never forget, and we have made plans to ensure that we’re prepared in case anything like that happens again. Business mirrors Life in many cases, and when I speak with customers across the country the topic of disaster recovery comes up often. The conversations typically have the following predictable patterns:
- I’ve just inherited the technology and systems of company X (we’ll call it company X to protect the innocent), and we have absolutely no backup or disaster recovery strategy at all. Can you help me?
- We had a disaster recovery strategy, but we haven’t really looked at it in a very long time, I’ve heard Cloud Computing can help me. Is that true?
- We have a disaster recovery approach we’re thinking about. Can you review it and validate that we’re leveraging best practices?
- I’m spending a fortune on disaster recovery gear that just sits idle 95% of the time. There has to be a better way.
The list is endless with permutations, and yes there is a better way. Disaster recovery as a workload is a very common one for a Cloud Computing solution, and there’s a number of ways you can approach it. As with anything there are tradeoffs of cost vs. functionality and typically depends on the business requirements. For example a full active/active environment where you need complete redundancy and sub second failover can be costly but potentially necessary depending on your business requirements. In the Financial Services industry for example, having revenue generating systems down for even a few seconds can cost a company millions of dollars.
We have helped companies of all sizes think about, design and implement disaster recovery strategies. From Pilot Lights, where there’s just the glimmer of an environment, to warm standby’s to fully redundant systems. The first step is to plan for the future and not wait until it’s too late.
-Mike Triolo, General Manager East
Amazon and 2nd Watch have published numerous white papers and blog articles on various ways to use Amazon Web Services™ (AWS) for a disaster recovery strategy. And there is no doubt at all that AWS is an excellent place to run a disaster recovery environment for on premise data centers and save companies enormous amounts of capital while preserving their business with the security of a full DR plan. For more on this, I’ll refer you to our DR is Dead article as an example of how this works.
What happens though, when you truly cannot have any downtime for your systems or a subset of your systems? Considering recent events like Hurricanes Sandy and Katrina, how do critical systems use AWS for DR when Internet connectivity cannot be guaranteed? How can cities prone to earthquakes justify putting DR systems in the Cloud when true disasters they have to consider involve large craters and severed fiber cables? Suddenly having multiple Internet providers doesn’t seem quite enough redundancy when all systems are Cloud based. Now to be fair, in such major catastrophes most users have greater concerns than ‘can I get my email?’ or ‘where’s that TPS report?’ but what if your systems are supporting first responders? DR has an even greater level of importance.
Typically, this situation is what keeps systems that have links to first responders, medical providers, and government from adopting a Cloud strategy or Cloud DR strategy. This is where a Reverse DR strategy has merit: moving production systems into AWS but keeping a pilot light environment on premise. I won’t reiterate the benefits of moving to AWS, there are more articles on this than I can possibly reference (but please, contact the 2nd Watch sales team and they’ll be happy to expound upon the benefits!) or rehash Ryan’s article on DR is Dead. What I will say is this: if you can move to AWS without risking those critical disaster response systems, why aren’t you?
By following the pilot light model in reverse, customers can leave enough on premise to keep the lights on in the event of disaster. With regularly scheduled s to make sure those on premise systems are running and sufficient for emergencies, customers can take advantage of the Cloud for a significant portion of their environments. From my experiences, once an assessment is completed to validate which systems are required on premise to support enough staff in the event of a disaster, most customers find themselves able to move 90%+ of their environment to the Cloud, save a considerable amount of money, and suffer no loss of functionality.
So put everything you’ve been told about DR in the Cloud in reverse, move your production environments to AWS and leave just enough to handle those pesky hurricanes on premise, and you’ve got yourself a reverse DR strategy using AWS.
-Keith Homewood, Cloud Architect
Databases tend to host the most critical data your business has. From orders, customers, products and even employee information – it’s everything that your business depends on. How much of that can you afford to lose?
With AWS you have options for database recovery depending on your budget and Recovery Time Objective (RTO).
Low budget/Long RTO:
- Whether you are in the cloud or on premise, using the AWS Command Line Interface (CLI) tools you can script uploads of your database backups directly to S3. This can be added as a step to an existing backup job or an entirely new job.
- Another option would be to use a third party tool to mount an S3 bucket as a drive. It’s possible to backup directly to the S3 bucket, but if you have write issues you may need to write the backup locally and then move it to the mounted drive.
These methods have a longer RTO as they will require you to stand up a new DB server and then restore the backups, but is a low cost solution to ensure you can recover your business. The catch here is that you can only restore to the last backup that you have taken and copied to S3. You may want to review you backup plans to ensure you are comfortable with what you may lose. Just make sure you use the native S3 lifecycle policies to purge old backups otherwise your storage bill will slowly get out of hand.
High budget/short RTO:
- Almost all mainstream Relational Database Management Systems (RDBMS) have a native method of replication. You can setup an EC2 Instance database server to replicate your database to. This can be in real-time so that you can be positive that you will not lose a single transaction.
- What about RDS? While you cannot use native RDBMS replication there are third party replication tools that will do Change Data Capture (CDC) replication directly to RDS. These can be easier to setup than the native replication methods, but you will want to make sure you are monitoring these tools to ensure that you do not get into a situation where you can lose transactional data.
Since this is DR you can lower the cost of these solutions by downsizing the RDS or EC2 instance. This will increase the RTO as you will need to manually resize the instances in the event of failure, but can be a significant cost saver. Both of these solutions will require connectivity to the instance over VPN or Direct Connect.
Another benefit of this solution is that it can easily be utilized for QA, Testing and development needs. You can easily snapshot the RDS or EC2 instance and stand up a new one to work against. When you are done – just terminate it.
With all database DR solutions, make sure you script out the permissions & server configurations. This either needs to be saved off with the backups or applied to RDS/EC2 instances. These are constantly changing and can create recovery issues if you do not account for them.
With an AWS database recovery plan you can avoid losing critical business data.
-Mike Izumi, Cloud Architect
Backup and disaster recovery often require solutions that add complexity and additional cost to properly synchronize your data and systems. Amazon Web Services™ (AWS) helps drive this cost and complexity with a number of services. Amazon S3 provides a highly durable (99.999999999%) storage platform for your backups. This service backs up your data to multiple availability zones (AZ) to provide you the ultimate peace of mind for your data. AWS also provides an ultra-low cost service for long-term cold storage that is aptly named Glacier. At $0.01 per GB / month this service will force you to ask, “Why am I not using AWS today?”
AWS has developed the AWS Storage Gateway to make your backups secure and efficient. For only $125 per backup location per month, you will have a robust solution that provides the following features:
- Secure transfers of all data to AWS S3 storage
- Compatible with your current architecture – there is no need to call up your local storage vendor for a special adapter or firmware version to use Storage Gateway
- Designed for AWS – this provides a seamless integration of your current environment to AWS services
AWS Storage Gateway and Amazon EC2 (snapshots of machine images) together provide a simple cloud-hosted DR solution. Amazon EC2 allows you to quickly launch images of your production environment in AWS when you need them. The AWS Storage Gateway seamlessly orchestrates with S3 to provide you a robust backup and disaster recovery solution that meets anyone’s budget.
-Matt Whitney, Sales Executive
Having been in the IT Industry since the 90s I’ve seen many iterations on Disaster Recovery principals and methodologies. The concept of DR of course far exceeds my tenure in the field as the idea started coming about in the 1970s as businesses began to realize their dependence on information systems and the criticality of those services.
Over the past decade or so we’ve really seen the concept of running a DR site at a colo facility (either leased or owned) become a popular way for organizations to have a rapidly available disaster recovery option. The problem with a colo facility is that it is EXPENSIVE! In addition to potentially huge CapEx (if you are buying your own infrastructure), you have the facility and infrastructure OpEx and all the overhead expense of managing those systems and everything that comes along with that. In steps the cloud… AWS and the other players in the public cloud arena provide you the ability to run a DR site without having really any CapEx. Now you are only paying for the virtual infrastructure that you are actually using as an operational cost.
An intelligently designed DR solution could leverage something like Amazon’s Pilot Light to keep your costs reduced by running the absolute minimal core infrastructure needed to keep the DR site fully ready to scale up to production. Well that is a big improvement over purchasing millions of dollars of hardware and having thousands and thousands of dollars in OpEx and overhead costs every month. Even still… there is a better way. If you architect your infrastructure and applications following the AWS best practices, then in a perfect world there is really no reason to have DR at all. By architecting your systems to balance across multiple AWS regions and availability zones; correctly designing architecture and applications for handling unpredictable and cascading failure; and to automatically and elastically scale to meet increases and decreases in demand you can effectively eliminate the need for DR. Your data and infrastructure are distributed in a way that is highly available and impervious to failure or spikes/drops in demand. So in addition to inherent DR, you are getting HA and true capacity-on-demand. The whole concept of a disaster taking down a data center and the subsequent effects on your systems, applications, and users becomes irrelevant. It may take a bit of work to design (or redesign) an application to this new cloud geo-distributed model, but I assure you that from a business continuity perspective, reduced TCO, scalability, and uptime it will pay off in spades.
That ought to put the proverbially nail in the coffin. RIP.
-Ryan Kennedy, Senior Cloud Engineer