Platform Reliability Engineering Manager

Hays

Malaysia, Kuala Lumpur

5-7 Years

Save

Posted 15 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Senior Platform Reliability Engineer with a notable MNC company within the Financial Services Sector

Your new company

We are looking for an experienced Senior Platform Reliability Engineer to join a high-performing infrastructure team responsible for delivering and supporting a large-scale, containerised platform environment.
In this role, you will contribute to building and operating distributed platform solutions, ensuring high levels of reliability, security, and performance. You will collaborate with cross-functional teams to resolve complex technical issues, enhance system efficiency, and promote engineering best practices across the organisation.

Your new role

Oversee the availability, performance, and resilience of an enterprise container platform and its supporting infrastructure, including capacity planning, monitoring, and incident management.
Proactively identify system reliability risks by evaluating dependencies, troubleshooting recurring issues, and addressing performance bottlenecks to improve overall stability and cost efficiency.
Provide advanced-level production support, working closely with engineering teams to investigate and resolve platform and application-related incidents.
Participate in a 24/7 on-call rotation, responding to monitoring alerts and restoring services promptly to ensure minimal disruption.
Continuously improve operational processes by identifying manual tasks and implementing automation to increase efficiency and reduce human error.
Perform regular system upgrades, patching, and maintenance activities to maintain security and platform integrity.
Work with modern engineering tools and frameworks, including open-source technologies, CI/CD pipelines, version control systems, and container orchestration platforms (e.g., Kubernetes, Docker).
Stay updated on emerging technologies and recommend enhancements to improve system capabilities and engineering practices.

What you'll need to succeed

Degree in Computer Science, Engineering, or a related discipline.
5-7 years of IT experience, including at least 3-5 years in a Site Reliability Engineering (SRE) or Platform Engineering role managing containerised environments.
Hands-on experience with container orchestration platforms (Kubernetes or similar enterprise solutions).
Strong background in automation using tools such as Ansible and scripting languages like Python or Bash.
Experience developing and maintaining Helm charts and repositories.
Familiarity with software-defined networking or similar enterprise networking solutions.
Relevant certifications (e.g., Kubernetes certifications) are advantageous.
Proven experience in high-pressure, fast-paced environments supporting mission-critical systems.
Strong understanding of reliability engineering concepts, including scalability, observability, and performance tuning.
Experience with monitoring and observability tools (e.g., dashboards, metrics tracking, SLIs/SLOs/SLAs).
Hands-on experience with CI/CD pipelines and tools for build, deployment, and version control.
Ability to troubleshoot infrastructure, application, and networking issues effectively.

What you'll get in return

Work on enterprise-scale, high-availability systems
Exposure to modern cloud-native technologies and DevOps practices
Opportunity to influence platform engineering standards and innovation
Collaborative environment with strong focus on growth and improvement

What you need to do now

If you're interested in this role, click apply now to forward an up-to-date copy of your CV, or call us now.

If this job isn't quite right for you, but you are looking for a new position, please contact us for a confidential discussion about your career.