System Operations Engineer

YTL AI Cloud

Malaysia, Kulai

2-4 Years

This job is no longer accepting applications

Posted 3 months ago

Job Description

Are you passionate about data center operations and cutting-edge AI infrastructure

Were looking for a DC System Operations Engineer to help us power the backbone of next-gen GPU clusters in our state-of-the-art AI Cloud facility.

In this role, youll be on the front line of maintaining the stability, performance, and security of our high-performance computing systems. From hands-on hardware replacement to system diagnostics and supporting GPU-based workloads, you&aposll be key in supporting the infrastructure behind advanced AI development.

Key Responsibilities

Oversee daily operations of GPU clusters and critical data center systems.
Perform preventative maintenance and hardware diagnostics for GPU/CPU/storage.
Monitor systems using tools like Prometheus & Grafana.
Collaborate with cross-functional teams to support scalable AI infrastructure.
Maintain documentation, enforce security standards, and troubleshoot issues.

Who We&aposre Looking For

Min. 2 years of experience in system operations, data centers, or cloud infrastructure.
Strong understanding of Linux fundamentals, Kubernetes environments, and server hardware.
Comfortable with hands-on IT hardware replacement and diagnostics.
Familiarity with monitoring tools and basic networking concepts.
Advantage: Experience with GPU servers, NVIDIA GPUs, high-performance computing, or bare metal infrastructure.

Preferred Background

Degree in Computer Science, Information Technology, Electrical Engineering, or equivalent experience.

Why Join Us

Be part of a high-growth AI infrastructure initiative under YTL.
Work in a fast-paced, forward-looking environment with state-of-the-art GPU clusters.
Opportunities for growth, upskilling, and cutting-edge tech exposure.