Service Stability Lead

Nasstar

Malaysia, Cyberjaya, Selangor

Fresher

Save

Posted 21 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Application Deadline: 29 May 2026

Department: Connectivity

Location: Cyberjaya

Description

The Service Stability Lead owns the end‑to‑end management of high‑severity incidents and underlying problems, ensuring rapid service restoration and permanent resolution of root causes.

The role combines real‑time incident leadership with proactive and reactive problem management to ensure that:

Major Incidents are effectively led, controlled, and communicated
Root causes are identified, understood, and eliminated
Trends and risks are proactively identified and mitigated
Improvements are implemented through to completion via Change Management

The role acts as the operational lead during Major Incidents and the owner of service stability, ensuring incidents are resolved quickly and do not recur.

Key Responsibilities

Major Incident Management (P1 / MI / MSI)

Lead and coordinate all Critical Incidents to drive rapid service restoration.
Act as the single point of control during incidents, directing resolver groups, technical teams, and stakeholders
Chair incident bridge calls and maintain pace, direction, and accountability
Ensure clear, structured, and timely communications, aligned to customer and business expectations
Maintain primary focus on service restoration, with structured follow‑up for root cause analysis
Empowered to:

Drive prioritization and actions during Major Incidents
Challenge delays or inadequate responses
Escalate where required to protect service and customer outcomes
Influence technical and operational decision‑making to protect service and customer outcomes.

Problem Management (End-to-End Ownership)

Own the end‑to‑end Problem Management lifecycle from identification through to closure
Ensure all Major Incidents transition into Problem records where required
Drive and quality assure Root Cause Analysis (RCA) using structured methodologies
Ensure all outputs are clear, fit for purpose (including customer‑facing where required), actionable, outcome‑driven, and tracked through to completion.
Produce and govern:

Post-Incident Reviews (PIRs)
Service Incident Reports (SIRs)
Root Cause Analysis Reports (RCAs)

End-to-End Lifecycle Ownership (Incident → Problem → Change)

Ensure clear linkage and traceability between:

Incidents
Problems
Known Errors
Change

Track all remediation actions through to successful implementation via Change Management
Prevent RCA without resolution by enforcing accountability for delivery of permanent fixes.
Work closely with Change Management to ensure fixes are:

Prioritized appropriately
Implemented safely
Delivering intended outcomes

Continuous Improvement, Risk & Prevention

Analyze incident data to identify:

Recurring issues and trends
Systemic weaknesses
Service risks (including legacy or accepted risks)

Define and improve:

Problem Management methodologies
KPIs and reporting frameworks
Preventative controls

Identify and drive improvements in:

Monitoring and alerting
Early detection capabilities
Service resilience

Apply proportionate operational approaches (e.g. streamlined handling for known or repeat issues) to balance efficiency with effectiveness

Communication, Reporting & Insight

Develop and deliver insight-led reporting, including:

P1 / MI / MSI trends
Root cause categorization
Service partner performance
Recurrence and stability metrics

Provide clear, insight‑led narratives for SLT, linking incidents to customer impact, root cause, and improvement actions
Ensure all communications are: