Tech Mahindra represents the connected world, offering innovative and customer-centric information technology experiences, enabling Enterprises, Associates, and the Society to Rise. It has 150,000+ professionals working for 1000+ Global Customers (including Fortune 500 companies) in 90 Countries. We're part of the esteemed Mahindra group, headquartered in India. Under a new CEO, Tech Mahindra is committed to a transformative journey with Scale @ Speed as our guiding principle.
Job Description (JD): IT Disaster Recovery (DR) Governance
1) Role Title
IT Disaster Recovery (DR) Governance / DR Governance Lead
(Alternate titles: IT Resilience Governance Manager, DR Program Governance Lead, IT Continuity Governance Lead)
3) Key Responsibilities
A. DR Governance Framework & Standards
- Define, implement, and maintain the DR governance model (policies, standards, procedures, controls, decision rights).
- Establish DR lifecycle governance: strategy → design → implementation → test → review → improve.
- Ensure alignment with enterprise BCM/ITSCM, cybersecurity, risk, compliance, and architecture standards.
- Maintain DR documentation control: versioning, approvals, evidence retention, and audit-ready repositories.
B. DR Strategy, Scope & Service Criticality
- Lead business impact alignment with BCM teams to confirm critical services, dependencies, and recovery objectives.
- Own/maintain the DR scope register (Tiering, service criticality, DR patterns, recovery method, dependencies).
- Ensure RTO/RPO, recovery sequencing, and minimum service levels are measurable and contract/governance aligned.
C. DR Plans, Runbooks & Readiness
- Govern creation and upkeep of:
- DR Plans (service-based and site-based)
- Technical Runbooks (step-by-step recovery procedures)
- Dependency maps (apps ↔ infra ↔ network ↔ identity ↔ storage/backup)
- Communication trees & escalation models for DR events
- Validate plans are feasible (people, process, technology) and aligned to real operational capabilities.
D. DR Testing Program (Design, Execution & Closure)
- Build and run the annual/quarterly DR test calendar (table-top, technical failover/failback, partial and full-scale).
- Chair DR test rehearsals, coordinate participants, define entry/exit criteria, and manage test execution governance.
- Capture outcomes: test evidence, results, deviations, gaps, and improvement actions.
- Enforce closure: action owners, target dates, risk acceptance processes, and retest requirements.
E. Risk, Issue & Exception Management
- Maintain DR risk register and issue log; assess impacts, prioritize remediation, track closure.
- Govern exceptions/waivers (e.g., RTO/RPO not met): require business justification, risk acceptance, and compensating controls.
- Drive continual improvement: lessons learned, maturity assessment, and roadmap updates.
F. DR Tooling, Monitoring & Metrics
- Define DR governance reporting: KRIs/KPIs, dashboards, and executive summaries.
- Ensure monitoring/telemetry exists for key resilience controls (backup health, replication status, failover readiness).
- Validate configuration integrity: DR environment parity, patching/versions, access control, and change alignment.
G. Change Management & Release Governance
- Integrate DR requirements into Change/Release processes:
- ensure DR impact assessment is mandatory for major changes
- validate DR plan updates for significant architecture/service changes
- enforce testing following high-impact deployments
- Provide governance sign-off for DR-related changes and ensure rollback/failback readiness.
H. Stakeholder & Vendor Management
- Act as primary governance interface across:
- Service Owners, Infrastructure/Cloud teams, Network/Security, App teams, Service Desk/ITOps, BCM, Audit/Risk
- Govern supplier participation in DR tests and ensure contract/SLA clauses are met (evidence and reporting).
- Facilitate SteerCo / governance forums: prepare packs, decisions, and action tracking.
I. DR Event Governance (During Actual Disaster)
- Support incident leadership during major disruptions:
- ensure DR invocation criteria are met and documented
- coordinate governance communications, logging, approvals, and evidence capture
- oversee controlled recovery and post-event review
4) Key Deliverables (Audit-Ready)
- DR Governance Policy, Standards, and Control Framework
- DR Scope & Tiering Register (service criticality, RTO/RPO, dependency mapping)
- DR Plans and Technical Runbooks (service/site/platform level)
- DR Test Strategy, Test Calendar, Test Scripts, Evidence Packs, and Test Reports
- DR Risk Register, Issue Log, Waiver/Exception Register
- DR Readiness Dashboard (KRIs/KPIs) and Executive SteerCo Pack
- Post-Test / Post-Incident Lessons Learned and Improvement Roadmap
- Annual DR Maturity Assessment & Program Plan
5) Success Measures (KPIs / KRIs)
You can tailor these to your contract/SLA model:
Governance & Coverage
- % of critical services with approved DR plans/runbooks (target: ≥ 95–100%)
- % of DR documentation updated within defined cadence (e.g., quarterly)
Testing & Assurance
- % planned DR tests executed on time (target: ≥ 90–95%)
- % DR tests meeting RTO/RPO (target: baseline then improve QoQ)
- of high/critical findings open beyond SLA (target: trending down)
Risk & Control
- Time to close DR test actions (median days)
- of active waivers/exceptions and aging
- Reduction in repeat findings across consecutive tests
Operational Readiness
- Backup/replication success rates, restore success rates (where measurable)
- DR environment parity compliance (patch level/version drift)
6) Required Skills & Competencies
Core Knowledge
- End-to-end DR/IT Service Continuity governance and execution
- Recovery design patterns: active-active, active-passive, warm standby, cold standby, backup/restore, pilot light
- Strong understanding of enterprise IT: compute/virtualization, storage/backup, network, IAM/AD, databases, cloud services
Governance & Delivery Skills
- Program governance, stakeholder management, and executive communication
- Strong documentation discipline and audit evidence management
- Risk management, exception handling, and control-based thinking
- Experience integrating DR with ITIL processes (Incident/Problem/Change/Release)
Tools (examples)
- ITSM tools (ServiceNow/Jira), CMDB, monitoring tools
- Backup and replication tools (e.g., Veeam/NetBackup/Commvault/Zerto), cloud DR services (AWS/Azure)
- Documentation repositories (SharePoint/Confluence), dashboarding (Power BI)
7) Experience & Qualifications
- 7–12+ years in IT operations / infrastructure / service continuity / resilience, with 3–5+ years in DR governance or resilience leadership.
- Proven track record in:
- governing DR programs across multi-tower teams (cloud + on‑prem + apps)
- leading DR tests and driving closure of findings
- preparing audit-ready evidence and handling internal/external audits
Preferred Certifications (any combination)
- ITIL (Foundation/Intermediate), ISO 22301/BCM exposure
- ISO 27001 / security governance awareness
- Cloud certifications (AWS/Azure), DR/BCP certifications (DRI/BCI) – optional but valuable
8) Behavioral Competencies
- High ownership and persistence (drives closure)
- Structured communicator (can write concise exec summaries + detailed runbooks)
- Comfortable challenging risk acceptance and ensuring accountability
- Calm under pressure (major incident / DR invocation scenarios)
9) Working Conditions / On-Call
- Primarily business hours; on-call / extended hours during DR tests, major incidents, or DR invocation events.
- May require coordination across time zones (onshore/offshore model).