Site Reliability Engineer

IntelliPro

Malaysia, Kuala Lumpur

3-5 Years

Save

Posted 7 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

This role combines software and systems engineering with the art of machine learning to build and run large-scale, massively distributed, and fault-tolerant systems. You will have the opportunity to sharpen your expertise in coding, performance analysis, and large-scale system design while making a tangible impact on the future of the Company's Infrastructure services and AML systems.

Responsibilities:

Design, build, and maintain highly available, scalable, and fault-tolerant systems. Collaborate with software engineering teams to ensure applications are designed with reliability and performance in mind.
Develop and maintain automation procedures to maximize system efficiency, minimize human intervention, and optimize routine tasks.
Monitor and analyze system performance to identify and address bottlenecks before they impact users. Ensure the infrastructure can handle rapid growth in web traffic and ML data processing.
Participate in 24/7 on-call rotations (including scheduled shifts and holidays). Practice sustainable on-call response, conduct root-cause analysis, and lead blameless post-mortems to prevent recurrence.
Implement monitoring tools (SLIs/SLOs/SLAs) and set up automated alerting and metrics to track system health and performance.
Implement and maintain security best practices and ensure all systems meet regulatory requirements.

Job Requirements:

Education: Bachelor's or Master's degree in Computer Science, Information Technology, Computer Engineering, or a related field.
Experience: 3+ years of experience as a Site Reliability Engineer, Systems Engineer, or Software Engineer.
Coding: Proficient in at least one high-level programming language (e.g., Python, Go, C++, or Java) and shell scripting. Strong understanding of data structures and algorithms.
Systems: Strong understanding of Linux operating systems and open-source technologies and a solid understanding of network architecture.
Databases: Competent knowledge of relational database systems and database modeling.

Preferred Qualifications:

Experience with containers and container orchestration platforms such as Docker and Kubernetes.
Proficiency in or exposure to machine learning frameworks such as TensorFlow, PyTorch, MXNet, or PaddlePaddle.
Hands-on experience with monitoring tools and methodologies (e.g., Prometheus, Grafana).
Soft Skills: Strategic thinking, exceptional communication, and the ability to collaborate effectively with cross-functional teams in a fast-paced environment.