Security & Disaster Recovery | Propeter Platform Security

Section 01

Purpose & Scope

This Disaster Recovery Plan (DRP) defines Propeter's preparedness, response, and recovery procedures for significant service disruptions. It exists to ensure that, in the event of a disaster, Propeter's engineering teams have a clear, rehearsed playbook — and that hotel clients can trust their revenue engine will be restored within committed timeframes.

Scope

All production systems and services comprising the Propeter platform across all AWS regions (ap-south-1, eu-west-1, ap-southeast-2, and their respective DR regions)
All sub-processor integrations where a failure would impact client service delivery
Personnel and communication procedures for all disaster scenarios covered in Section 2

Out of Scope

Development and staging environments (separate runbooks apply)
Disasters affecting only individual client configurations (handled via standard support P1 process)
Business continuity planning beyond IT systems (office access, physical security) — covered in separate BCP document

Relationship to Other Documents: This DRP works in conjunction with the Backup & Data Retention Policy (source of truth for RPO/RTO data) and the Uptime SLA (client-facing commitments that this DRP is designed to fulfil).

Section 02

Disaster Scenarios Covered

This DRP covers the following six categories of disaster. Each has a dedicated recovery procedure in Section 4.

🏢 AWS Regional Outage

Complete or partial failure of an AWS region (e.g., ap-south-1 Mumbai). Triggers full DR failover to designated secondary region. Highest severity, most complex recovery.

🗃️ Database Corruption / Deletion

Accidental or malicious corruption, truncation, or deletion of PostgreSQL data. Recovery via Point-in-Time Recovery (PITR) or snapshot restore. May be caused by bug, operational error, or attack.

🔒 Ransomware Attack

Malicious encryption of production systems by an attacker. Requires containment, eradication, and restore from clean, pre-attack backups. Involves law enforcement notification and forensic investigation.

🌊 DDoS Attack

Volumetric or application-layer Distributed Denial of Service attack targeting Propeter's public endpoints. Primary mitigation is AWS Shield + WAF; DR failover used if mitigation insufficient.

👤 Key Personnel Unavailability

Sudden, unplanned unavailability of critical staff (CTO, CISO, Infrastructure Lead) due to illness, accident, or resignation during an active incident. Succession plan and documented runbooks cover this scenario.

🔌 Sub-Processor Outage

Outage affecting a critical sub-processor: Twilio (SMS notifications), SendGrid (email), Firebase (mobile push), Windcave/Qvalent (payment tokenisation). Each has documented fallback or degraded-mode operation.

Section 03

DR Team & Escalation

Incident Commander

Chief Technology Officer (CTO)

Overall disaster response authority

Makes final decisions on DR invocation, resource allocation, and client communication timing. Authorises RTO/RPO exception requests. Chairs post-incident review.

Technical Lead

Lead Infrastructure Engineer

Hands-on recovery execution

Executes technical recovery procedures. Coordinates with AWS Support if needed. Provides status updates to Incident Commander every 30 minutes during active incident.

Communications Lead

Chief Executive Officer (CEO)

External communications authority

Approves all client-facing communications. Manages strategic partner and media communications. Coordinates with Customer Success team on individual client outreach for Enterprise clients.

Security Lead

Chief Information Security Officer (CISO)

Security assessment and breach determination

Determines if incident constitutes a security breach requiring regulatory notification. Manages forensic evidence preservation. Coordinates with legal counsel. Leads post-incident security review.

Escalation Matrix

Severity	Detection	Initial Response SLA	Escalation Path	Client Communication
P1 — Critical	Automated CloudWatch + PagerDuty page	15 minutes	On-call Engineer → Infrastructure Lead (30 min) → CTO + CISO (1 hour)	Status page + proactive email within 1 hour
P2 — High	Automated alert + manual report	1 hour	On-call Engineer → Infrastructure Lead (2 hours)	Status page update within 2 hours
P3 — Medium	Monitoring alert or client report	4 hours	Support Engineer → Infrastructure Engineer (8 hours)	Status page if >20% clients affected
P4 — Low	Client report or internal discovery	1 business day	Support ticket queue → Engineering team	In-app notice only if broadly relevant

Section 04

Recovery Procedures

🏢

Scenario A: AWS Regional Outage

P1 — Critical

Complete loss of primary AWS region. RTO target: 4 hours (Tier 1: 1 hour). RPO: 1 hour (Tier 1: 15 minutes).

1

Detection T+0 — Automated

CloudWatch Synthetic Monitoring detects failed health checks across all primary region endpoints. GuardDuty and Route 53 health checks confirm. PagerDuty pages on-call Infrastructure Engineer immediately. AWS Service Health Dashboard checked for regional incident declaration.
2

Assessment T+10 min

On-call engineer confirms: regional failure vs. application-level failure vs. network issue. Attempts to reach primary region control plane via alternative network path. Checks AWS Personal Health Dashboard for confirmed regional event. Documents assessment outcome in #dr-incidents Slack channel.
3

DR Decision T+15 min

Incident Commander (CTO) makes GO/NO-GO decision to invoke DR failover. If primary region unavailable for >15 minutes with no AWS ETA: GO decision. If AWS ETA <1 hour: monitor and reassess. GO decision triggers immediate notification to entire DR team.
4

Failover Execution T+15 to T+75 min

Infrastructure Lead executes DR runbook: (1) Promote RDS read replica in DR region to standalone primary. (2) Update application configuration in DR region to point to new DB endpoint. (3) Scale up ECS tasks in DR region (Terraform apply from pre-prepared DR config). (4) Update Route 53 DNS records to DR region load balancer (TTL pre-set to 60 seconds for fast propagation). (5) Verify application health via DR region endpoints — all critical paths tested. (6) Confirm rate engine is processing correctly.
5

Client Communication T+60 min

CEO approves P1 communication template. Status page (status.propeter.com) updated with incident status, affected regions, and ETA. Proactive email sent to all affected clients via SendGrid (or Twilio SMS if SendGrid affected). Enterprise clients receive direct call from Customer Success Manager.
6

DR Operation & Primary Restoration T+4 hrs onward

Platform operates on DR region. Propeter monitors primary region recovery. When primary region available: run integrity checks on primary DB, compare with DR region DB (row counts, audit trail). Plan cutback to primary during next maintenance window to minimise client impact. Cutback is transparent to hotel clients.
7

Post-Incident Review Within 5 business days

Blameless post-mortem conducted by CTO, CISO, and Infrastructure Lead. Timeline, contributing factors, and impact quantified. Action items assigned with owners and deadlines. Post-mortem report shared with affected Enterprise clients on request.

🗃️

Scenario B: Database Corruption or Deletion

P1 — Critical

Accidental or malicious corruption/deletion of data in PostgreSQL. PITR is the primary recovery mechanism.

1

Detection T+0

Automated data integrity checks run every 15 minutes and compare row counts against rolling baseline. Anomalous change (>10% row count drop in any table) triggers immediate PagerDuty P1 alert. Alternatively: client or internal report of data inconsistency.
2

Immediate Containment T+5 min

Put database into read-only mode immediately to prevent further writes. If corruption appears write-path related: disable application write endpoints via feature flag. Snapshot current (corrupted) state for forensic analysis — do not overwrite. Revoke all application database credentials; rotate immediately to prevent further corruption if credentials are compromised.
3

Root Cause Assessment T+15 min

CISO and Infrastructure Lead assess: accidental operational error, application bug, or malicious act. If malicious: invoke full incident response (Section 6 of Security Overview). If operational error: proceed directly to PITR. Identify the exact timestamp of corruption from CloudWatch Logs and RDS audit trail.
4

Point-in-Time Recovery (PITR) T+30 to T+90 min

Initiate RDS PITR restore to 1 second before identified corruption timestamp. PITR restores to a new RDS instance (original instance preserved for forensics). Validate restored DB: run data integrity suite, spot-check specific records, confirm audit trail continuity. Identify and quantify data loss window (difference between PITR timestamp and corruption time).
5

Application Cutover to Restored DB T+90 min to T+3 hrs

Update application DB connection string to point to restored instance. Gradually re-enable write endpoints. Monitor application error rate for 30 minutes before declaring recovery complete. Issue new application DB credentials scoped to restored instance.
6

Client Impact Assessment and Notification T+2 hrs

Identify which tenants had data in the corruption window. Quantify: which bookings, rate changes, or forecasts may need to be re-entered. Notify affected hotel clients directly with specific details of the data loss window (if any). Assist clients in reconciling their PMS data against Propeter if needed.
7

Post-Incident Review Within 5 business days

Full root cause analysis. Bug fix deployed and tested before any write traffic resumes in production. Post-mortem distributed to affected clients. Preventive controls evaluated and implemented.

🌊

Scenario C: DDoS Attack

P1/P2 — Depends on Impact

Volumetric or application-layer DDoS. Primary mitigation is automated; DR failover is last resort.

1
Automated Mitigation
AWS Shield Standard automatically detects and mitigates volumetric network-layer DDoS. WAF rate-limiting blocks high-volume IP sources. CloudFront absorbs traffic — edge network scrubs malicious requests before reaching origin.
2
Manual Assessment
If automated mitigation insufficient: Infrastructure Lead analyses WAF logs to identify attack pattern. Applies targeted IP or geo-based block rules. Engages AWS Shield Response Team (SRT) if available (requires AWS Shield Advanced — on roadmap).
3
Rate Engine Priority Mode
Implement read-only mode for non-critical dashboard features to reduce origin load. Rate engine API given priority compute allocation. Cache TTLs extended to reduce origin hits during attack period.
4
Regional Failover (Last Resort)
If primary region completely saturated: Route 53 DNS failover to DR region. DR region serves traffic while origin region is protected. Client notifications sent via email (not affected by web DDoS).

🔌

Scenario D: Sub-Processor Outage

P2/P3 — Depends on Processor

Critical sub-processor unavailable. Each has a defined fallback mode.

Twilio (SMS) Outage

Fallback: Email notifications via SendGrid. If both unavailable: in-app notification only. Severity: P3 (non-critical path). No rate engine impact.

SendGrid (Email) Outage

Fallback: Queue emails; retry when restored (up to 24 hours). Critical alerts rerouted via Twilio SMS. Severity: P3. No rate engine impact.

Firebase (Push) Outage

Fallback: In-app notifications only; SMS for critical alerts. Mobile app degraded (push disabled). Severity: P3. Revenue engine unaffected.

Windcave/Qvalent (Payments) Outage

Fallback: Payment processing temporarily unavailable. Rate engine continues unaffected. Clients notified. SLA clock for payment SLA paused. Severity: P2 for payment features, P3 for overall platform.

Section 05

Communication Plan

Internal Communication Channels

Primary: Slack #dr-incidents channel — all DR team members and on-call engineers
Secondary: PagerDuty incident bridge — voice call if Slack unavailable
Backup: Pre-shared WhatsApp group (DR Team Core 5) — for total system unavailability
Status updates: Infrastructure Lead posts to #dr-incidents every 30 minutes during active P1

Client-Facing Communication Channels

Status page: status.propeter.com — first channel updated; automated and manual posting capability
Email: SendGrid broadcast to affected client accounts (or all clients for P1)
SMS: Twilio for Enterprise clients with SMS alerts enabled
Direct call: Customer Success Manager calls Enterprise clients for P1 incidents within 2 hours
In-app banner: Displayed in dashboard for degraded/maintenance states

Communication Templates

P1 — Initial Incident Notice (within 1 hour of detection)

Subject: [ACTION REQUIRED] Propeter Service Incident — [REGION] — [DATE]

Dear [Property Name] team,

We are writing to notify you of an active service incident affecting Propeter's
platform in the [affected region].

Status: [Service degraded / Service unavailable]
Affected services: [Revenue Engine / Dashboard / Rate Push]
Incident start time: [TIME] UTC
Current status: Our engineering team is actively working on restoration.
Estimated resolution: [TIME or "Under investigation"]

What this means for your hotel:
[Specific impact — e.g., "Rates are currently frozen at last pushed values.
No new rate recommendations will be generated until service is restored."]

Live status: https://status.propeter.com
We will provide an update within [60 / 120] minutes.

The Propeter Reliability Team

P1 — Resolution Notice (after service restoration)

Subject: [RESOLVED] Propeter Service Incident — [DATE]

Dear [Property Name] team,

We are pleased to confirm that the service incident affecting Propeter's platform
has been resolved.

Resolution time: [TIME] UTC
Total incident duration: [X hours Y minutes]
Root cause (preliminary): [Brief description]
Data impact: [No data loss / Data loss window: TIME to TIME]

All services are now operating normally. Rate recommendations are being generated
and distributed to your connected channels.

A full post-incident report will be available within 5 business days.
Any SLA credits will be applied to your next invoice automatically.

We sincerely apologise for the disruption to your operations.

The Propeter Reliability Team

Section 06

DR Testing Schedule

Quarterly

Tabletop Exercise

Facilitated discussion-based walkthrough of DR scenarios. DR team members talk through their roles, decisions, and procedures without any system changes. Identifies gaps in the plan and team knowledge.

Zero production impact

Bi-Annual

Partial Failover Test

Actual failover of specific Tier 2 services to DR region in staging environment. Tests DNS failover, database promotion, and application startup in DR region. Measures actual vs. target RTO.

Staging env only

Annual

Full Failover Drill

Complete simulation of Scenario A (AWS Regional Outage) during a scheduled maintenance window. Full DR failover executed on production infrastructure. Measures end-to-end RTO and validates all procedures.

Maintenance window

Test Results: All DR test results are documented, reviewed by the CISO, and stored in the security audit log. Any test that fails to meet RTO/RPO targets triggers an immediate remediation project. Enterprise clients may request the most recent DR test results summary (under NDA).

Section 07

RTO / RPO Commitments by Service Tier

Cross-reference with Backup & Data Retention Policy for full methodology.

Service Tier	Services Included	RPO	RTO
Tier 1 — Mission Critical	Revenue Engine, Rate API, Rate Push, PMS Sync, Booking Ingestion	15 minutes	1 hour
Tier 2 — High Priority	Dashboard, User Auth, Property Config, Real-time Occupancy, Revenue Reports	1 hour	4 hours
Tier 3 — Standard Priority	Historical Analytics, Long-term Forecast Trends, Marketing Analytics, Xero Sync	24 hours	24 hours

Section 08

DR Plan Maintenance

Review Schedule

Full plan review: annual (Q1 each year)
Triggered review: after any P1 incident, after any significant infrastructure change, after any DR team personnel change
Review includes: scenario coverage, contact details, runbook accuracy, and lessons learned from any real incidents or tests

Version History

v3.0 — January 2026: Added sub-processor outage scenario; updated DR regions; added PITR procedure detail
v2.0 — March 2025: Added AU (Sydney) region; updated escalation matrix; added communication templates
v1.0 — September 2024: Initial DR Plan; India and EU regions only

Continuous Improvement: This DR Plan is a living document. Every incident, test, and near-miss produces learnings that are incorporated at the next review cycle or immediately if the finding is critical. Propeter's reliability team reviews this document before every quarterly tabletop exercise.

Disaster Recovery Plan

Purpose & Scope

Scope

Out of Scope

Disaster Scenarios Covered

🏢 AWS Regional Outage

🗃️ Database Corruption / Deletion

🔒 Ransomware Attack

🌊 DDoS Attack

👤 Key Personnel Unavailability

🔌 Sub-Processor Outage

DR Team & Escalation

Escalation Matrix

Recovery Procedures

Twilio (SMS) Outage

SendGrid (Email) Outage

Firebase (Push) Outage

Windcave/Qvalent (Payments) Outage

Communication Plan

Internal Communication Channels

Client-Facing Communication Channels

Communication Templates

DR Testing Schedule

RTO / RPO Commitments by Service Tier

DR Plan Maintenance

Review Schedule

Version History

Products

Resources

Company