Updated: March 25, 2021
Disaster Recovery (DR) is the documented and tested organizational strategy and set of procedures designed to enable the resumption of critical business operations and the recovery of IT infrastructure (including data, software, and hardware systems) following a disruptive event. These events can range from natural disasters and cyberattacks to technical failures and human errors.
Purpose: The primary purpose of this Disaster Recovery Plan (DRP) is to ensure business continuity and facilitate a swift and orderly recovery from any significant disruption, minimizing downtime and data loss to maintain operational resilience.
Scope: This DRP encompasses all critical components of the EviSmart SaaS platform that are essential for delivering services to our customers. This includes, but is not limited to:
Recovery Time Objective (RTO): To restore critical SaaS services and functionalities within a predefined timeframe.
Recovery Point Objective (RPO): To recover data to a point in time that minimizes data loss.
Minimize Downtime: To reduce the duration of service unavailability during and after a disruptive event, ensuring minimal impact on customers.
Minimize Data Loss: To protect the integrity and availability of critical data and ensure its recovery to the most recent possible point in time.
Ensure Customer Trust: To maintain customer confidence in the reliability and resilience of our services by demonstrating a robust recovery capability.
Maintain Regulatory Compliance: To adhere to all relevant industry regulations and compliance requirements related to data protection and business continuity.
This Disaster Recovery Plan (DRP) applies to all systems, services, and data that are integral to providing our SaaS offering to customers, including:
Cloud-based Applications: The core software applications delivered as a service.
Hosting Infrastructure: The underlying cloud environment where our applications and data reside (e.g., specific Azure regions and services).
Customer Support Channels: The communication methods used to support our customers during normal operations and during a disaster event.
Data Backups and Recovery Processes: The procedures and technologies used to create and restore data backups.
Network Connectivity: The network infrastructure required for internal operations and customer access.
High Availability is a proactive strategy focused on preventing downtime by building redundancy and fault tolerance into our systems. In our cloud environment, HA is achieved through:
Clustering: Critical application components and databases are deployed across multiple servers (nodes) that operate as a single logical unit. This ensures that if one server fails, another automatically takes over, maintaining service availability.
Shared Storage: In many cases, clustered servers access the same shared storage for critical data. This allows any active node to access the latest data, ensuring seamless failover.
Active-Passive Configuration: Our infrastructure operates primarily in an Active-Passive mode. This means one node is actively handling all incoming requests, while one or more standby nodes are continuously monitored and ready to take over instantly in case the active node experiences a failure. This minimizes the time required for failover.
Load Balancing: Traffic is distributed across multiple active servers to prevent any single server from being overwhelmed and to improve overall performance and availability. (While Active-Passive is mentioned, it's worth noting if load balancing is also used for the active node).
Failover Mechanisms: Currently, our systems rely on manual intervention to initiate failover to standby nodes upon failure detection. However, we are actively developing and implementing automated failover capabilities to eliminate the need for manual intervention in the future.
Disaster Recovery is a reactive strategy designed to restore our services and data in the event of a major disruption affecting a wider scope than what HA can handle. This includes scenarios like:
Our DR strategy focuses on replicating our environment and data to a secondary location (e.g., a different Azure region) to ensure business continuity. A critical component of our data recovery strategy is our near real-time data backup.
In the event of a disater:
Channels include:
Frequency: Quarterly DR drills – Testing DRP/Back Up strategies
Metrics: RTO/RPO met, recovery time, resolution time
Review: Quarterly / after major incidents