🚨 Disaster Recovery Plan

Disaster Recovery Plan

Propeter's enterprise-grade DR plan defines how we detect, respond to, and recover from disasters — ensuring your hotel's revenue engine stays online.

Version
v3.0
Effective Date
1 January 2026
Owner
CTO + CISO
Review Cycle
Annual
Standard RPO
1 Hour
Standard RTO
4 Hours
Section 01

Purpose & Scope

This Disaster Recovery Plan (DRP) defines Propeter's preparedness, response, and recovery procedures for significant service disruptions. It exists to ensure that, in the event of a disaster, Propeter's engineering teams have a clear, rehearsed playbook — and that hotel clients can trust their revenue engine will be restored within committed timeframes.

Scope

  • All production systems and services comprising the Propeter platform across all AWS regions (ap-south-1, eu-west-1, ap-southeast-2, and their respective DR regions)
  • All sub-processor integrations where a failure would impact client service delivery
  • Personnel and communication procedures for all disaster scenarios covered in Section 2

Out of Scope

  • Development and staging environments (separate runbooks apply)
  • Disasters affecting only individual client configurations (handled via standard support P1 process)
  • Business continuity planning beyond IT systems (office access, physical security) — covered in separate BCP document

Relationship to Other Documents: This DRP works in conjunction with the Backup & Data Retention Policy (source of truth for RPO/RTO data) and the Uptime SLA (client-facing commitments that this DRP is designed to fulfil).

Section 02

Disaster Scenarios Covered

This DRP covers the following six categories of disaster. Each has a dedicated recovery procedure in Section 4.

🏢 AWS Regional Outage

Complete or partial failure of an AWS region (e.g., ap-south-1 Mumbai). Triggers full DR failover to designated secondary region. Highest severity, most complex recovery.

🗃️ Database Corruption / Deletion

Accidental or malicious corruption, truncation, or deletion of PostgreSQL data. Recovery via Point-in-Time Recovery (PITR) or snapshot restore. May be caused by bug, operational error, or attack.

🔒 Ransomware Attack

Malicious encryption of production systems by an attacker. Requires containment, eradication, and restore from clean, pre-attack backups. Involves law enforcement notification and forensic investigation.

🌊 DDoS Attack

Volumetric or application-layer Distributed Denial of Service attack targeting Propeter's public endpoints. Primary mitigation is AWS Shield + WAF; DR failover used if mitigation insufficient.

👤 Key Personnel Unavailability

Sudden, unplanned unavailability of critical staff (CTO, CISO, Infrastructure Lead) due to illness, accident, or resignation during an active incident. Succession plan and documented runbooks cover this scenario.

🔌 Sub-Processor Outage

Outage affecting a critical sub-processor: Twilio (SMS notifications), SendGrid (email), Firebase (mobile push), Windcave/Qvalent (payment tokenisation). Each has documented fallback or degraded-mode operation.

Section 03

DR Team & Escalation

Incident Commander
Chief Technology Officer (CTO)
Overall disaster response authority
Makes final decisions on DR invocation, resource allocation, and client communication timing. Authorises RTO/RPO exception requests. Chairs post-incident review.
Technical Lead
Lead Infrastructure Engineer
Hands-on recovery execution
Executes technical recovery procedures. Coordinates with AWS Support if needed. Provides status updates to Incident Commander every 30 minutes during active incident.
Communications Lead
Chief Executive Officer (CEO)
External communications authority
Approves all client-facing communications. Manages strategic partner and media communications. Coordinates with Customer Success team on individual client outreach for Enterprise clients.
Security Lead
Chief Information Security Officer (CISO)
Security assessment and breach determination
Determines if incident constitutes a security breach requiring regulatory notification. Manages forensic evidence preservation. Coordinates with legal counsel. Leads post-incident security review.

Escalation Matrix

Severity Detection Initial Response SLA Escalation Path Client Communication
P1 — Critical Automated CloudWatch + PagerDuty page 15 minutes On-call Engineer → Infrastructure Lead (30 min) → CTO + CISO (1 hour) Status page + proactive email within 1 hour
P2 — High Automated alert + manual report 1 hour On-call Engineer → Infrastructure Lead (2 hours) Status page update within 2 hours
P3 — Medium Monitoring alert or client report 4 hours Support Engineer → Infrastructure Engineer (8 hours) Status page if >20% clients affected
P4 — Low Client report or internal discovery 1 business day Support ticket queue → Engineering team In-app notice only if broadly relevant
Section 04

Recovery Procedures

🏢
Scenario A: AWS Regional Outage
P1 — Critical

Complete loss of primary AWS region. RTO target: 4 hours (Tier 1: 1 hour). RPO: 1 hour (Tier 1: 15 minutes).

  1. 1
    Detection T+0 — Automated
    CloudWatch Synthetic Monitoring detects failed health checks across all primary region endpoints. GuardDuty and Route 53 health checks confirm. PagerDuty pages on-call Infrastructure Engineer immediately. AWS Service Health Dashboard checked for regional incident declaration.
  2. 2
    Assessment T+10 min
    On-call engineer confirms: regional failure vs. application-level failure vs. network issue. Attempts to reach primary region control plane via alternative network path. Checks AWS Personal Health Dashboard for confirmed regional event. Documents assessment outcome in #dr-incidents Slack channel.
  3. 3
    DR Decision T+15 min
    Incident Commander (CTO) makes GO/NO-GO decision to invoke DR failover. If primary region unavailable for >15 minutes with no AWS ETA: GO decision. If AWS ETA <1 hour: monitor and reassess. GO decision triggers immediate notification to entire DR team.
  4. 4
    Failover Execution T+15 to T+75 min
    Infrastructure Lead executes DR runbook: (1) Promote RDS read replica in DR region to standalone primary. (2) Update application configuration in DR region to point to new DB endpoint. (3) Scale up ECS tasks in DR region (Terraform apply from pre-prepared DR config). (4) Update Route 53 DNS records to DR region load balancer (TTL pre-set to 60 seconds for fast propagation). (5) Verify application health via DR region endpoints — all critical paths tested. (6) Confirm rate engine is processing correctly.
  5. 5
    Client Communication T+60 min
    CEO approves P1 communication template. Status page (status.propeter.com) updated with incident status, affected regions, and ETA. Proactive email sent to all affected clients via SendGrid (or Twilio SMS if SendGrid affected). Enterprise clients receive direct call from Customer Success Manager.
  6. 6
    DR Operation & Primary Restoration T+4 hrs onward
    Platform operates on DR region. Propeter monitors primary region recovery. When primary region available: run integrity checks on primary DB, compare with DR region DB (row counts, audit trail). Plan cutback to primary during next maintenance window to minimise client impact. Cutback is transparent to hotel clients.
  7. 7
    Post-Incident Review Within 5 business days
    Blameless post-mortem conducted by CTO, CISO, and Infrastructure Lead. Timeline, contributing factors, and impact quantified. Action items assigned with owners and deadlines. Post-mortem report shared with affected Enterprise clients on request.
🗃️
Scenario B: Database Corruption or Deletion
P1 — Critical

Accidental or malicious corruption/deletion of data in PostgreSQL. PITR is the primary recovery mechanism.

  1. 1
    Detection T+0
    Automated data integrity checks run every 15 minutes and compare row counts against rolling baseline. Anomalous change (>10% row count drop in any table) triggers immediate PagerDuty P1 alert. Alternatively: client or internal report of data inconsistency.
  2. 2
    Immediate Containment T+5 min
    Put database into read-only mode immediately to prevent further writes. If corruption appears write-path related: disable application write endpoints via feature flag. Snapshot current (corrupted) state for forensic analysis — do not overwrite. Revoke all application database credentials; rotate immediately to prevent further corruption if credentials are compromised.
  3. 3
    Root Cause Assessment T+15 min
    CISO and Infrastructure Lead assess: accidental operational error, application bug, or malicious act. If malicious: invoke full incident response (Section 6 of Security Overview). If operational error: proceed directly to PITR. Identify the exact timestamp of corruption from CloudWatch Logs and RDS audit trail.
  4. 4
    Point-in-Time Recovery (PITR) T+30 to T+90 min
    Initiate RDS PITR restore to 1 second before identified corruption timestamp. PITR restores to a new RDS instance (original instance preserved for forensics). Validate restored DB: run data integrity suite, spot-check specific records, confirm audit trail continuity. Identify and quantify data loss window (difference between PITR timestamp and corruption time).
  5. 5
    Application Cutover to Restored DB T+90 min to T+3 hrs
    Update application DB connection string to point to restored instance. Gradually re-enable write endpoints. Monitor application error rate for 30 minutes before declaring recovery complete. Issue new application DB credentials scoped to restored instance.
  6. 6
    Client Impact Assessment and Notification T+2 hrs
    Identify which tenants had data in the corruption window. Quantify: which bookings, rate changes, or forecasts may need to be re-entered. Notify affected hotel clients directly with specific details of the data loss window (if any). Assist clients in reconciling their PMS data against Propeter if needed.
  7. 7
    Post-Incident Review Within 5 business days
    Full root cause analysis. Bug fix deployed and tested before any write traffic resumes in production. Post-mortem distributed to affected clients. Preventive controls evaluated and implemented.
🌊
Scenario C: DDoS Attack
P1/P2 — Depends on Impact

Volumetric or application-layer DDoS. Primary mitigation is automated; DR failover is last resort.

  1. 1
    Automated Mitigation
    AWS Shield Standard automatically detects and mitigates volumetric network-layer DDoS. WAF rate-limiting blocks high-volume IP sources. CloudFront absorbs traffic — edge network scrubs malicious requests before reaching origin.
  2. 2
    Manual Assessment
    If automated mitigation insufficient: Infrastructure Lead analyses WAF logs to identify attack pattern. Applies targeted IP or geo-based block rules. Engages AWS Shield Response Team (SRT) if available (requires AWS Shield Advanced — on roadmap).
  3. 3
    Rate Engine Priority Mode
    Implement read-only mode for non-critical dashboard features to reduce origin load. Rate engine API given priority compute allocation. Cache TTLs extended to reduce origin hits during attack period.
  4. 4
    Regional Failover (Last Resort)
    If primary region completely saturated: Route 53 DNS failover to DR region. DR region serves traffic while origin region is protected. Client notifications sent via email (not affected by web DDoS).
🔌
Scenario D: Sub-Processor Outage
P2/P3 — Depends on Processor

Critical sub-processor unavailable. Each has a defined fallback mode.

Twilio (SMS) Outage

Fallback: Email notifications via SendGrid. If both unavailable: in-app notification only. Severity: P3 (non-critical path). No rate engine impact.

SendGrid (Email) Outage

Fallback: Queue emails; retry when restored (up to 24 hours). Critical alerts rerouted via Twilio SMS. Severity: P3. No rate engine impact.

Firebase (Push) Outage

Fallback: In-app notifications only; SMS for critical alerts. Mobile app degraded (push disabled). Severity: P3. Revenue engine unaffected.

Windcave/Qvalent (Payments) Outage

Fallback: Payment processing temporarily unavailable. Rate engine continues unaffected. Clients notified. SLA clock for payment SLA paused. Severity: P2 for payment features, P3 for overall platform.

Section 05

Communication Plan

Internal Communication Channels

  • Primary: Slack #dr-incidents channel — all DR team members and on-call engineers
  • Secondary: PagerDuty incident bridge — voice call if Slack unavailable
  • Backup: Pre-shared WhatsApp group (DR Team Core 5) — for total system unavailability
  • Status updates: Infrastructure Lead posts to #dr-incidents every 30 minutes during active P1

Client-Facing Communication Channels

  • Status page: status.propeter.com — first channel updated; automated and manual posting capability
  • Email: SendGrid broadcast to affected client accounts (or all clients for P1)
  • SMS: Twilio for Enterprise clients with SMS alerts enabled
  • Direct call: Customer Success Manager calls Enterprise clients for P1 incidents within 2 hours
  • In-app banner: Displayed in dashboard for degraded/maintenance states

Communication Templates

P1 — Initial Incident Notice (within 1 hour of detection)
Subject: [ACTION REQUIRED] Propeter Service Incident — [REGION][DATE]

Dear [Property Name] team,

We are writing to notify you of an active service incident affecting Propeter's
platform in the [affected region].

Status: [Service degraded / Service unavailable]
Affected services: [Revenue Engine / Dashboard / Rate Push]
Incident start time: [TIME] UTC
Current status: Our engineering team is actively working on restoration.
Estimated resolution: [TIME or "Under investigation"]

What this means for your hotel:
[Specific impact — e.g., "Rates are currently frozen at last pushed values.
No new rate recommendations will be generated until service is restored."]

Live status: https://status.propeter.com
We will provide an update within [60 / 120] minutes.

The Propeter Reliability Team
P1 — Resolution Notice (after service restoration)
Subject: [RESOLVED] Propeter Service Incident — [DATE]

Dear [Property Name] team,

We are pleased to confirm that the service incident affecting Propeter's platform
has been resolved.

Resolution time: [TIME] UTC
Total incident duration: [X hours Y minutes]
Root cause (preliminary): [Brief description]
Data impact: [No data loss / Data loss window: TIME to TIME]

All services are now operating normally. Rate recommendations are being generated
and distributed to your connected channels.

A full post-incident report will be available within 5 business days.
Any SLA credits will be applied to your next invoice automatically.

We sincerely apologise for the disruption to your operations.

The Propeter Reliability Team
Section 06

DR Testing Schedule

Quarterly
Tabletop Exercise
Facilitated discussion-based walkthrough of DR scenarios. DR team members talk through their roles, decisions, and procedures without any system changes. Identifies gaps in the plan and team knowledge.
Zero production impact
Bi-Annual
Partial Failover Test
Actual failover of specific Tier 2 services to DR region in staging environment. Tests DNS failover, database promotion, and application startup in DR region. Measures actual vs. target RTO.
Staging env only
Annual
Full Failover Drill
Complete simulation of Scenario A (AWS Regional Outage) during a scheduled maintenance window. Full DR failover executed on production infrastructure. Measures end-to-end RTO and validates all procedures.
Maintenance window

Test Results: All DR test results are documented, reviewed by the CISO, and stored in the security audit log. Any test that fails to meet RTO/RPO targets triggers an immediate remediation project. Enterprise clients may request the most recent DR test results summary (under NDA).

Section 07

RTO / RPO Commitments by Service Tier

Cross-reference with Backup & Data Retention Policy for full methodology.

Service Tier Services Included RPO RTO
Tier 1 — Mission Critical Revenue Engine, Rate API, Rate Push, PMS Sync, Booking Ingestion 15 minutes 1 hour
Tier 2 — High Priority Dashboard, User Auth, Property Config, Real-time Occupancy, Revenue Reports 1 hour 4 hours
Tier 3 — Standard Priority Historical Analytics, Long-term Forecast Trends, Marketing Analytics, Xero Sync 24 hours 24 hours
Section 08

DR Plan Maintenance

Review Schedule

  • Full plan review: annual (Q1 each year)
  • Triggered review: after any P1 incident, after any significant infrastructure change, after any DR team personnel change
  • Review includes: scenario coverage, contact details, runbook accuracy, and lessons learned from any real incidents or tests

Version History

  • v3.0 — January 2026: Added sub-processor outage scenario; updated DR regions; added PITR procedure detail
  • v2.0 — March 2025: Added AU (Sydney) region; updated escalation matrix; added communication templates
  • v1.0 — September 2024: Initial DR Plan; India and EU regions only

Continuous Improvement: This DR Plan is a living document. Every incident, test, and near-miss produces learnings that are incorporated at the next review cycle or immediately if the finding is critical. Propeter's reliability team reviews this document before every quarterly tabletop exercise.