Security & Legal
Privacy PolicyTerms of Use AgreementData Integrity and Security Policy for Cloud-Stored Customer DataDisaster Recovery Plan

Disaster Recovery Plan

Updated: March 25, 2021

1. General Information

Disaster Recovery (DR) is the documented and tested organizational strategy and set of procedures designed to enable the resumption of critical business operations and the recovery of IT infrastructure (including data, software, and hardware systems) following a disruptive event. These events can range from natural disasters and cyberattacks to technical failures and human errors.

Purpose: The primary purpose of this Disaster Recovery Plan (DRP) is to ensure business continuity and facilitate a swift and orderly recovery from any significant disruption, minimizing downtime and data loss to maintain operational resilience.

Scope: This DRP encompasses all critical components of the EviSmart SaaS platform that are essential for delivering services to our customers. This includes, but is not limited to:

  • Application Layer: All software applications and services that constitute the SaaS offering.
  • Infrastructure Layer: The underlying cloud infrastructure (computer, storage, networking) provided by our cloud provider.
  • Data Layer: All customer data, application data, and configuration data.
  • Network Layer: All network components and connectivity required for service delivery.
  • Third-Party Services: Any external services critical to the functionality of our platform.

2. Objectives

Recovery Time Objective (RTO): To restore critical SaaS services and functionalities within a predefined timeframe.

Recovery Point Objective (RPO): To recover data to a point in time that minimizes data loss.

Minimize Downtime: To reduce the duration of service unavailability during and after a disruptive event, ensuring minimal impact on customers.

Minimize Data Loss: To protect the integrity and availability of critical data and ensure its recovery to the most recent possible point in time.

Ensure Customer Trust: To maintain customer confidence in the reliability and resilience of our services by demonstrating a robust recovery capability.

Maintain Regulatory Compliance: To adhere to all relevant industry regulations and compliance requirements related to data protection and business continuity.

3. Scope

This Disaster Recovery Plan (DRP) applies to all systems, services, and data that are integral to providing our SaaS offering to customers, including:

Cloud-based Applications:
The core software applications delivered as a service.

Hosting Infrastructure: The underlying cloud environment where our applications and data reside (e.g., specific Azure regions and services).

Customer Support Channels: The communication methods used to support our customers during normal operations and during a disaster event.

Data Backups and Recovery Processes: The procedures and technologies used to create and restore data backups.

Network Connectivity: The network infrastructure required for internal operations and customer access.

4. Disaster Types Covered

Risk Type
Description
Likelihood
Impact
Natural Disaster
Earthquake, flood, fire
Low
High
Cyber Attack
Ransomware, DDoS, data breach
High
High
Human Error
Accidental deletion or misconfiguration
Medium
Medium
System Failure
Hardware/software/network outage
High
High
Third-party Failure
Azure service or API downtime
Medium
High

5. Key Roles and Responsibilities

Role
Responsibility
Disaster Recovery Lead
Oversees implementation of the DRP
DevOps Team
Restores systems and ensures data integrity
Customer Success Team
Communicate updates to customers
Executive Team
Coordinates high-level response strategy

6. Recovery Strategy

High Availablity (HA)

High Availability is a proactive strategy focused on preventing downtime by building redundancy and fault tolerance into our systems. In our cloud environment, HA is achieved through:

Clustering: Critical application components and databases are deployed across multiple servers (nodes) that operate as a single logical unit. This ensures that if one server fails, another automatically takes over, maintaining service availability.

Shared Storage: In many cases, clustered servers access the same shared storage for critical data. This allows any active node to access the latest data, ensuring seamless failover.

Active-Passive Configuration: Our infrastructure operates primarily in an Active-Passive mode. This means one node is actively handling all incoming requests, while one or more standby nodes are continuously monitored and ready to take over instantly in case the active node experiences a failure. This minimizes the time required for failover.

Load Balancing: Traffic is distributed across multiple active servers to prevent any single server from being overwhelmed and to improve overall performance and availability. (While Active-Passive is mentioned, it's worth noting if load balancing is also used for the active node).

Failover Mechanisms: Currently, our systems rely on manual intervention to initiate failover to standby nodes upon failure detection. However, we are actively developing and implementing automated failover capabilities to eliminate the need for manual intervention in the future.

Disaster Recovery (DR)

Disaster Recovery is a reactive strategy designed to restore our services and data in the event of a major disruption affecting a wider scope than what HA can handle. This includes scenarios like:

  • Complete Region Outage: Failure of an entire geographic region provided by our cloud provider.
  • Global Cloud Service Offering/Provider Outages: Widespread issues affecting core services of our cloud provider.
  • Significant Natural Disasters: Events impacting the primary infrastructure location.

Our DR strategy focuses on replicating our environment and data to a secondary location (e.g., a different Azure region) to ensure business continuity. A critical component of our data recovery strategy is our near real-time data backup.

  • Near Real-Time Data Backup: We maintain backups of our active data with a target gap of approximately 30 minutes. This ensures that in the event of a disaster affecting our primary data source, we have a recent and consistent copy of our data available for recovery. This near real-time capability significantly minimizes potential data loss.
  • Backup Data Activation: If the primary data source is compromised or unavailable due to a disaster, our recovery procedures include the ability to switch on the near real-time backup as the new active data source in our secondary environment. This process is designed to be as efficient as possible to meet our RTO targets.
RPO & RTO Definition & Explanation
  • Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. With our near real-time backups, our target RPO is approximately 30 minutes.
  • Recovery Time Objective (RTO): The maximum acceptable time to restore critical business functions after a disaster. Our target RTO is 1-2 hours.
DR Execution Steps:
  1. Detection & Assessment: Continuous monitoring systems and processes are in place to detect potential outages or security breaches. Upon identification of a significant event, the designated DR team will assess the impact and declare a disaster if the predefined criteria are met.
  2. Activation: Once a disaster is declared, the DR team is immediately notified through predefined communication channels. This triggers the initiation of the documented DR protocols and procedures.
  3. Failover: In the event of a regional outage or similar disaster, we will initiate a failover to our standby environment in a secondary Azure region using Azure Traffic Manager. This will redirect user traffic to the operational secondary environment.
  4. Data Recovery: Depending on the nature of the disaster, we will either restore systems from traditional Azure Backup or, if the primary data is affected, activate our near real-time data backup as the primary data source in the secondary environment. Thorough validation will be performed to ensure data integrity and service functionality.
  5. Restoration: Once the failover and data recovery are complete, the DR team will focus on restoring any remaining systems and ensuring full-service functionality in the secondary environment.
  6. Communication: Throughout the disaster irecovery process, regular updates will be communicated internally to stakeholders and externally to our customers through the defined communication channels.
  7. Post-Mortem: Following the recovery and stabilization of services, a comprehensive post-mortem analysis will be conducted to identify the root cause of the event, evaluate the effectiveness of the DR plan, and implement any necessary improvements to prevent future occurrences and enhance our recovery capabilities.

7. Recovery Time & Point Objectives

EviSmart Module
Backup Frequency
RPO
RTO
Recovery Strategy
EviSmart – LMS 
(Lab Management System)
Every 30 minutes
30 minutes
< 2 hours
Switch to Azure Back up

Restore from Azure Backup
EviSmart CAD
Hourly
30 minutes
< 2 hours
Switch to Azure Backup  
Restore from Azure Backup
EviSmart Core, Case Downloader, QC, Case Entry
Hourly
30 minutes
< 2 hours
Restore from Azure Backup

8. Customer Communication Plan

In the event of a disater:

  1. Initial notification within 30 minutes of event detection.  
  2. Regular updates every 1–2 hours or as necessary
  3. Final incident summary report within 72 hours after recovery.

Channels include:

  • Email
  • Customer support phone hot line

9. Test and Maintenance

Frequency: Quarterly DR drills – Testing DRP/Back Up strategies

Metrics: RTO/RPO met, recovery time, resolution time

Review: Quarterly / after major incidents

10. Contact Us