Overview

This article outlines ACME's backup and recovery by data storage resource type.


TABLE OF CONTENTS


Backup and Recovery by Data Storage Resource Type

Workstations

ACME developers use company-provided and secured laptops to write code and create configurations. Files on each workstation are backed up regularly and stored encrypted. To provide additional confidentiality, integrity, and availability of data, application code or configuration has to be stored in a code repository system (or GitHub1) at all times. All documents created on workstations that are related to daily operations are stored in Google Drive2 in the cloud. ACME also requires a review and approval process for both checking-in production code and document updates on Google Drive2


AWS S3 buckets

All static assets and documents related to the ACME applications are securely stored in AWS S3 buckets. Amazon S3 provides a highly durable storage infrastructure designed for mission-critical and primary data storage. Objects are redundantly stored on multiple devices across multiple facilities in an Amazon S3 Region. AWS protects backups with 99.999999999% data durability and 99.99% availability of objects over a given year, designed to sustain the concurrent loss of data in two facilities. Copies of all data uploaded to Amazon S3 are created and stored across at least three devices in a single AWS Region for failover.  Accidental loss of data is prevented by Access Control Lists, no-deletion resource rules, and versioning.


Database servers

ACME uses two types of databases in the data layer: MySQL Relational Database System and MongoDB NoSQL data store.


The production MySQL RDBMS is a primary-active secondary replication setup. This enables a quick recovery if the primary database fails for any reason. ACME retains data storage EBS volume snapshots every 6hrs which is stored with 99.999999999% data durability. Figure 1 shows how the MySQL primary-active secondary replication strategy works. See Figure 1 MySQL Primary-Active Secondary Replication.


Figure 1 MySQL Primary-Active Secondary Replication


The production NoSQL mongoDB database has the High Availability (HA) setup with one primary and three secondary replicaSets. This provides 99.99%+ uptime and availability of the data. In the event that the primary fails, one of the secondaries will be elected to be a primary in no-downtime. ACME keeps EBS volume snapshots of the data volumes every 6hrs. Figure 2 provides a visual depiction of how the HA mongoDB cluster works. See Figure 2 MongoDB Automatic Failover Setup.


Figure 2 MongoDB Automatic Failover Setup

.


In addition to the HA replication and point-in-time snapshots for both database types used, at the network layer of the physical infrastructure, additional recovery safeguards are put in place  for failover and failback.



Key Data Recovery Metrics

In addition to implementing industry best practices for data backup and recovery, ACME has defined key data recovery metrics to abide by and achieve excellent end-user experience. The following table details these data recovery metrics that are guaranteed by ACME. The two metrics we use are:

  1. Recovery Time Objectives (or RTO): is the amount of time it takes to recover normal business operations after an outage. 
  2. Recovery Point Objective (or RPO): refers to the amount of data you can afford to lose in a disaster in time.


Table 1 Availability Objectives

MetricGoogle DriveS3 bucketsMySQLMongoDB
Availability>99.9399.99%4>99.5%5>99.99%6


Given the availability objectives in Table 1, the corresponding RTO and RPO are provided in Table 2.


Table 2 Key Data Recovery Metrics

MetricGoogle DriveS3 bucketsMySQLMongoDB
Recovery Time
Objective (RTO)
<5 sec<1 sec~15min<1 sec
Recovery Point
Objective (RPO)
<1 sec<1 sec<1 sec<1 sec



List of Acronyms

Table 3 List of Acronyms

AcroynmMeaning
HAHigh Availability
RTORecovery Time Objective
RPORecovery Point Objective

1GitHub is SOC 1, SOC 2, GDPR compliant, and approved FedRAMP LI-saas provider. 

 2Google is SOC 2 and SOC 3, ISO 2007, ISO 27017, ISO 27018, FedRAMP, FISC, PCI-DSS, HIPAA, and GDPR compliant. 

 3Google SLA document can be found here: https://gsuite.google.com/terms/sla.html

 4AWS SLA document can be found here: https://aws.amazon.com/s3/sla/ and the objective document can be found here:
https://docs.aws.amazon.com/AmazonS3/latest/dev/DataDurability.html

 5Internally maintained system. This is ACME availability objective.

 6Internally maintained system. This is ACME availability objective.