IT Disaster Recovery (DR) Governance

Mahindra Satyam

Malaysia, Kuala Lumpur

7-12 Years

Save

Posted 17 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Tech Mahindra represents the connected world, offering innovative and customer-centric information technology experiences, enabling Enterprises, Associates, and the Society to Rise. It has 150,000+ professionals working for 1000+ Global Customers (including Fortune 500 companies) in 90 Countries. We're part of the esteemed Mahindra group, headquartered in India. Under a new CEO, Tech Mahindra is committed to a transformative journey with Scale @ Speed as our guiding principle.

Job Description (JD): IT Disaster Recovery (DR) Governance

1) Role Title

IT Disaster Recovery (DR) Governance / DR Governance Lead

(Alternate titles: IT Resilience Governance Manager, DR Program Governance Lead, IT Continuity Governance Lead)

3) Key Responsibilities

A. DR Governance Framework & Standards

Define, implement, and maintain the DR governance model (policies, standards, procedures, controls, decision rights).
Establish DR lifecycle governance: strategy → design → implementation → test → review → improve.
Ensure alignment with enterprise BCM/ITSCM, cybersecurity, risk, compliance, and architecture standards.
Maintain DR documentation control: versioning, approvals, evidence retention, and audit-ready repositories.

B. DR Strategy, Scope & Service Criticality

Lead business impact alignment with BCM teams to confirm critical services, dependencies, and recovery objectives.
Own/maintain the DR scope register (Tiering, service criticality, DR patterns, recovery method, dependencies).
Ensure RTO/RPO, recovery sequencing, and minimum service levels are measurable and contract/governance aligned.

C. DR Plans, Runbooks & Readiness

Govern creation and upkeep of:
DR Plans (service-based and site-based)
Technical Runbooks (step-by-step recovery procedures)
Dependency maps (apps ↔ infra ↔ network ↔ identity ↔ storage/backup)
Communication trees & escalation models for DR events
Validate plans are feasible (people, process, technology) and aligned to real operational capabilities.

D. DR Testing Program (Design, Execution & Closure)

Build and run the annual/quarterly DR test calendar (table-top, technical failover/failback, partial and full-scale).
Chair DR test rehearsals, coordinate participants, define entry/exit criteria, and manage test execution governance.
Capture outcomes: test evidence, results, deviations, gaps, and improvement actions.
Enforce closure: action owners, target dates, risk acceptance processes, and retest requirements.

E. Risk, Issue & Exception Management

Maintain DR risk register and issue log; assess impacts, prioritize remediation, track closure.
Govern exceptions/waivers (e.g., RTO/RPO not met): require business justification, risk acceptance, and compensating controls.
Drive continual improvement: lessons learned, maturity assessment, and roadmap updates.

F. DR Tooling, Monitoring & Metrics

Define DR governance reporting: KRIs/KPIs, dashboards, and executive summaries.
Ensure monitoring/telemetry exists for key resilience controls (backup health, replication status, failover readiness).
Validate configuration integrity: DR environment parity, patching/versions, access control, and change alignment.

G. Change Management & Release Governance

Integrate DR requirements into Change/Release processes:
ensure DR impact assessment is mandatory for major changes
validate DR plan updates for significant architecture/service changes
enforce testing following high-impact deployments
Provide governance sign-off for DR-related changes and ensure rollback/failback readiness.

H. Stakeholder & Vendor Management

Act as primary governance interface across:
Service Owners, Infrastructure/Cloud teams, Network/Security, App teams, Service Desk/ITOps, BCM, Audit/Risk
Govern supplier participation in DR tests and ensure contract/SLA clauses are met (evidence and reporting).
Facilitate SteerCo / governance forums: prepare packs, decisions, and action tracking.

I. DR Event Governance (During Actual Disaster)

Support incident leadership during major disruptions:
ensure DR invocation criteria are met and documented
coordinate governance communications, logging, approvals, and evidence capture
oversee controlled recovery and post-event review

4) Key Deliverables (Audit-Ready)

DR Governance Policy, Standards, and Control Framework
DR Scope & Tiering Register (service criticality, RTO/RPO, dependency mapping)
DR Plans and Technical Runbooks (service/site/platform level)
DR Test Strategy, Test Calendar, Test Scripts, Evidence Packs, and Test Reports
DR Risk Register, Issue Log, Waiver/Exception Register
DR Readiness Dashboard (KRIs/KPIs) and Executive SteerCo Pack
Post-Test / Post-Incident Lessons Learned and Improvement Roadmap
Annual DR Maturity Assessment & Program Plan

5) Success Measures (KPIs / KRIs)

You can tailor these to your contract/SLA model:

Governance & Coverage

% of critical services with approved DR plans/runbooks (target: ≥ 95–100%)
% of DR documentation updated within defined cadence (e.g., quarterly)

Testing & Assurance

% planned DR tests executed on time (target: ≥ 90–95%)
% DR tests meeting RTO/RPO (target: baseline then improve QoQ)
of high/critical findings open beyond SLA (target: trending down)

Risk & Control

Time to close DR test actions (median days)
of active waivers/exceptions and aging
Reduction in repeat findings across consecutive tests

Operational Readiness

Backup/replication success rates, restore success rates (where measurable)
DR environment parity compliance (patch level/version drift)

6) Required Skills & Competencies

Core Knowledge

End-to-end DR/IT Service Continuity governance and execution
Recovery design patterns: active-active, active-passive, warm standby, cold standby, backup/restore, pilot light
Strong understanding of enterprise IT: compute/virtualization, storage/backup, network, IAM/AD, databases, cloud services

Governance & Delivery Skills

Program governance, stakeholder management, and executive communication
Strong documentation discipline and audit evidence management
Risk management, exception handling, and control-based thinking
Experience integrating DR with ITIL processes (Incident/Problem/Change/Release)

Tools (examples)

ITSM tools (ServiceNow/Jira), CMDB, monitoring tools
Backup and replication tools (e.g., Veeam/NetBackup/Commvault/Zerto), cloud DR services (AWS/Azure)
Documentation repositories (SharePoint/Confluence), dashboarding (Power BI)

7) Experience & Qualifications

7–12+ years in IT operations / infrastructure / service continuity / resilience, with 3–5+ years in DR governance or resilience leadership.
Proven track record in:
governing DR programs across multi-tower teams (cloud + on‑prem + apps)
leading DR tests and driving closure of findings
preparing audit-ready evidence and handling internal/external audits

Preferred Certifications (any combination)

ITIL (Foundation/Intermediate), ISO 22301/BCM exposure
ISO 27001 / security governance awareness
Cloud certifications (AWS/Azure), DR/BCP certifications (DRI/BCI) – optional but valuable

8) Behavioral Competencies

High ownership and persistence (drives closure)
Structured communicator (can write concise exec summaries + detailed runbooks)
Comfortable challenging risk acceptance and ensuring accountability
Calm under pressure (major incident / DR invocation scenarios)

9) Working Conditions / On-Call

Primarily business hours; on-call / extended hours during DR tests, major incidents, or DR invocation events.
May require coordination across time zones (onshore/offshore model).