Disaster Recovery in AWS

By Scott Ross

In this post we are going to cover considerations a unit implementing AWS should be aware of when moving into the cloud.  Developing a disaster recovery (DR) plan for the cloud encompass the same considerations that on premise DR plans.  But obviously there are differences between on premise vs. cloud environments.  These difference usually manifest themselves in questions like who is responsible, how does a unit implement a DR solution, and how does ones mindset change when thinking about DR in the cloud.

It is true that DR in the cloud is more robust then on campus (another blog post…), but with this power comes more responsibility.  More tools are at the disposal of your development/operations team, and those tools are more flexible and dynamic. There are more architectural patterns, and some of those patterns differ dramatically from on premise solutions (of which we have a long history with).  And, of course, you are now dependent on more factors – networking, amazon itself, and many, many others.

Let’s start by reintroducing what disaster recovery is, and what it is not.

Disaster Recovery (DR) is defined as the process, policies, and procedures that are related to preparing for recovery or continuation of technology infrastructure, which are vital to an organization after a natural or human induced crisis.

DR is not backups.
DR complements other high availability (HA) services, but while HA deals with disaster prevention, DR is for those times when prevention’s have failed.

Disaster Recovery planning is always a trade-off between Recovery Time Objective (RTO) and Recovery Point Objective (RPO) vs. cost/complexity.  At a very basic level, DR cannot be done without input from your business partners – they will need to be aware of Risks/Rewards, and thus will drive appropriate IT strategy for disaster recovery (in another post, we will discuss ways you can categorize your services for DR).

RTO, RPO and Data Replication are key conceptual terms when dealing with DR.  Here is a brief introduction:

Recovery Point Objective (RPO) – It is the maximum targeted period in which data might be lost from an IT service due to a major incident.

Recovery Time Objective (RTO) – Is the targeted duration of time and a service level with which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.

Data Replication – sharing information so as to ensure consistency between redundant resources.

 

AWS Basics 

  • AWS Regions
    • 9 Regions worldwide (see map)
    • Completely independent with different teams and infrastructure
  • AWS Availability Zones (AZ)
    • Each region contains 1 or more AZs
    • Physically separated, but in the same geographical location
    • Shared teams and software infrastructure
  • US-EAST-1 (region)
    • Located in Virginia
    • 5 availability zones (as of October 2016)
  • AWS is dynamic resource allocation
    • Pay for resources as your go
    • Create/Destroy resources quickly using dashboard or various CLI tools

Example AWS Disasters

  • Single-Resource Disaster

What is it?  A single resource stops functioning … (EBS, ELB, EC2..)

How to prepare?  Make sure that no single resource is a point of failure.  Use AMI’s and autoscaling to help you make stateless instances (or use docker!).  Configure RAIDs for volumes.  Use AWS managed services (like RDS).

  • Single-AZ Disaster

What is it?  A whole AZ goes down

How to prepare?  Build your system so that it’s spread across multiple AZs and can survive downtime to any single AZ.  Connect subnets in different AZs to your ELB and turn on multi-AZ for RDS.

  • Single-service Disaster

What is it?  A whole AWS service (such as RDS) goes down across the entire region.

How to prepare?  First, resist the urge to use AWS resources for everything (but remember this is a cost/complexity question as well).  Be ready to recreate your system in a different region (by using automation).  Invest in the upfront preparation to do this (VPC setup, networking, etc al that need to be in place for to turn on a region)

  • Whole-Region disaster

What is it?  A whole AWS region goes down taking all the applications running on it with it. 

How to prepare? Implement a cross region DR methodology.  Take snapshots of your instances and copy them to different regions.  Use automation tools (cloudformation/ansible/chef etc al) to define your application stack.  Copy AMIs to a different region.  Understand AWS services you are using, and their ability to be used in different regions (RDS for example)

Disaster Recovery Patterns 

  • Backup and Restore
    • Advantage
      • Simple to get started
      • Very cost effective
    • Disadvantage
      • Likely not real time
      • Downtime will occur
    • Preparation Phase
      • Take backups of current systems (and schedule these backups)
      • Use S3 as the storage mechanism
      • Describe procedure to restore
    • Process…
      • Retrieve backups from S3
      • Bring up required infrastructure
        • EC2 instances with prepared AMIs, Load balancing, etc al
      • Restore system from backup
      • Switch over to the new system
    • Objectives…
      • RTO: As long as it takes to bring up infrastructure and restore system from backups
      • RPO: From last backup
  • “Pilot Light” in different regions
    • Advantages:
      • Reduces RTP and RPO
      • Resources just need to be turned on
    • Disadvantage:
      • Cost
    • Preparation
      • Enable replication of data across regions
      • Automation of services in backup region
      • Switch over DNS to other region when downtime occurs
  • Fully working low capacity standby’s
    • Advantages:
      • Reduces RTP and RPO
      • Resources just need to be turned on
      • Can handle traffic or production
    • Disadvantage:
      • Cost
      • All necessary components are running 24/7
    • Preparation
      • Enable replication of data across regions
      • Automation of services in backup region
      • Switch over DNS to other region when downtime occurs
  • Multi-Site hot backup
    • Same as low capacity, but full scaled… 

Tooling and Consistency

There lots of tooling in the industry to help implement HA and DR.  Here, we are going to cover some of tools used on campus to build infrastructure as code and the automation that’s critical to successful DR plans.  People on campus know these tools:

  • AWS Cloudformation
  • Configuration Management: Ansible, Chef, Puppet
  • Continuous Integration/deployment: Jenkins
  • Containerization: Docker
  • Preferred Languages / AWS SDK: Ruby, Python

For more, see https://confluence.cornell.edu/display/CLOUD/Cloud+Service+Matrix

AWS services DR checklists

Finally, these are some basic checklists that might help you as you discover the best strategies to implement DR.  

RDS DR Check List:

  • Setup production instances in a multi-AZ architecture
  • Enable RDS Daily Backups
  • Create cross region Read Replication in the recovery region of your choice (remember to consider your application in this! Data, without an app, doesn’t work!)
  • Maintain DB settings in both regions (better yet – script those settings! Use infrastructure as code!)
  • Make sure you have the appropriate alerting / monitoring setup; especially to monitor sync errors between the regions
  • Schedule drills!  Create failovers!  Practice!

Basic Checklist:

  • Design for failure
  • Placing services in multiple availability zones
  • Have a de-coupled architecture, to reduce single points of failure
  • Use load balancers, autoscaling, and health monitoring rules for HA
  • Use CloudFormation templates
  • Maintain up to date AMI’s
  • Backing up critical data sets to S3
  • Running “warm” servers in other regions for recovery
  • A written plan
  • Testing that plan often! 

Some Takeaways – 

  1. Design DR (& HA) into your architecture when moving to the cloud. You can “lift and shift”, but you are likely not taking advantage of cloud and are still susceptible to the same disaster risks as on premise.
  2. Take advantage of what AWS offers you and use those as your building blocks.
  3. However, understand the impact of relying on these services and building blocks. Don’t treat AWS as a ‘black box’.
  4. Exercise your DR solution!!!! (that’s one of the many benefits of the cloud; we can do this now)