DevOps Lead

Tech Mahindra Limited

Kuala Lumpur

8-12 Years

Save

Posted 28 days ago
Be among the first 40 applicants

Early Applicant

Quick Apply

Job Description

Job Title: Site Reliability Engineer (SRE) – CI/CD & Observability

Role Overview

We are seeking a highly motivated Site Reliability Engineer (SRE) responsible for ensuring the reliability, performance, and scalability of critical enterprise applications and infrastructure. The role focuses on managing CI/CD pipelines, proactive monitoring to maintain high service availability and operational excellence.

Key Responsibilities

CI/CD Pipeline Management

Design, implement, and maintain robust CI/CD pipelines to support automated build, test, and deployment processes.
Optimize pipeline performance, reliability, and security.
Integrate pipelines with version control systems, artifact repositories, and automated testing frameworks.
Support release management and continuous delivery practices.

Observability & Monitoring

Implement and manage application and infrastructure monitoring using Dynatrace.
Configure dashboards, alerts, and performance baselines to enable proactive issue detection.
Analyze system performance metrics, logs, and traces to identify optimization opportunities.
Drive observability best practices across applications and middleware layers.

AIOps & Incident Intelligence

Manage and configure BigPanda for event correlation, noise reduction, and incident prioritization.
Integrate monitoring tools with BigPanda to provide unified operational visibility.
Automate incident response workflows and improve Mean Time to Resolution (MTTR).

Reliability Engineering

Establish SRE practices including SLIs, SLOs, and error budgets.
Perform root cause analysis (RCA) for incidents and implement preventive measures.
Drive automation initiatives to reduce manual operational tasks.
Support capacity planning, resilience engineering, and high-availability architecture.

Collaboration

Work closely with DevOps, application teams, infrastructure teams, and ITSM teams.
Participate in incident response and on-call rotations.
Contribute to continuous improvement initiatives for platform reliability.

Required Skills

Technical Skills

Experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, Azure DevOps, etc.)
Strong hands-on experience with Dynatrace monitoring and observability
Experience using BigPanda or other AIOps platforms
Experience with cloud environments (Azure / GCP)
Knowledge of container platforms (Docker, Kubernetes)
Strong scripting skills (Python, Bash, or PowerShell)

Platform & Infrastructure

Linux/Unix system administration
Experience with middleware and application servers
Familiarity with microservices architecture and distributed systems

Preferred Skills

Knowledge of SRE frameworks and reliability engineering practices
Experience with ITSM tools (ServiceNow / Remedy)
Experience with Infrastructure as Code (Terraform / ARM)
Understanding of logging platforms (Azure log analytics, App Insights and Monitor , Splunk)