Position Title: Lead Platform Engineer (SRE Lead)
Location:Kuala Lumpur, Malaysia near by Bukit Bintang
Industry:Insurance
Open to:Malaysian citizens only
About the Role:
We are seeking an experienced and driven Lead Site Reliability Engineer to join our technology organization. In this role, you will be responsible for ensuring the reliability, scalability, and performance of critical systems and applications. You will lead SRE initiatives, champion automation, and collaborate closely with development and operations teams to build resilient, high-performing platforms that support our business and customers.
Key Responsibilities:
- Lead SRE efforts to maintain and improve system reliability, availability, and performance across production and non-production environments.
- Design, implement, and maintain monitoring, alerting, and observability frameworks to proactively detect and resolve incidents.
- Drive incident management processes, including root cause analysis, post-incident reviews, and implementation of preventive measures.
- Champion automation across infrastructure provisioning, deployment, and operational tasks to reduce manual effort and improve consistency.
- Collaborate with engineering teams to define and enforce service level objectives (SLOs), service level indicators (SLIs), and error budgets.
- Lead capacity planning, performance tuning, and scalability assessments to ensure systems meet growing business demands.
- Manage and optimize cloud infrastructure (Azure, AWS) and containerized environments (Docker, Kubernetes).
- Establish and promote SRE best practices, including chaos engineering, disaster recovery planning, and resilience testing.
- Mentor and guide junior SRE team members, fostering a culture of operational excellence and continuous improvement.
- Work closely with development teams to embed reliability considerations into the software development lifecycle.
Required Skills & Experience:
- Strong knowledge of Linux/Unix systems and networking fundamentals.
- Proficiency in programming and scripting languages such as Python, Ansible, PowerShell, .Net, or Java.
- Hands-on experience with cloud platforms (e.g., Azure, AWS).
- Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
- Expertise in monitoring and observability tools such as AppDynamics, Application Insights, Dynatrace, Grafana, or the ELK Stack.
- Strong understanding of CI/CD pipelines and automation frameworks.
- Proven problem-solving skills and ability to perform root cause analysis.
- Excellent communication and collaboration skills.
- Analytical mindset with a focus on reliability, scalability, and performance.
- Passion for automation and reducing manual toil.
- Ability to work under pressure and handle critical incidents effectively.
- Commitment to continuous learning and staying updated on industry trends.
Desired Qualifications:
- Experience with distributed systems and microservices architecture.
- Knowledge of database systems (both SQL and NoSQL).
- Familiarity with incident management frameworks (e.g., ITIL, SRE best practices).
- Certifications in cloud technologies or DevOps tools.
Why Join Us
- Lead SRE strategy for a major organization within the insurance industry.
- Work with modern cloud technologies, containerization, and observability tools.
- Collaborate with cross-functional teams to drive reliability and operational excellence.
- Be part of a culture that values automation, innovation, and continuous learning.
- Play a key role in shaping resilient systems that directly impact business and customer outcomes.