Are you passionate about data center operations and cutting-edge AI infrastructure
Were looking for a DC System Operations Engineer to help us power the backbone of next-gen GPU clusters in our state-of-the-art AI Cloud facility.
In this role, youll be on the front line of maintaining the stability, performance, and security of our high-performance computing systems. From hands-on hardware replacement to system diagnostics and supporting GPU-based workloads, you&aposll be key in supporting the infrastructure behind advanced AI development.
Key Responsibilities
- Oversee daily operations of GPU clusters and critical data center systems.
- Perform preventative maintenance and hardware diagnostics for GPU/CPU/storage.
- Monitor systems using tools like Prometheus & Grafana.
- Collaborate with cross-functional teams to support scalable AI infrastructure.
- Maintain documentation, enforce security standards, and troubleshoot issues.
Who We&aposre Looking For
- Min. 2 years of experience in system operations, data centers, or cloud infrastructure.
- Strong understanding of Linux fundamentals, Kubernetes environments, and server hardware.
- Comfortable with hands-on IT hardware replacement and diagnostics.
- Familiarity with monitoring tools and basic networking concepts.
- Advantage: Experience with GPU servers, NVIDIA GPUs, high-performance computing, or bare metal infrastructure.
Preferred Background
- Degree in Computer Science, Information Technology, Electrical Engineering, or equivalent experience.
Why Join Us
- Be part of a high-growth AI infrastructure initiative under YTL.
- Work in a fast-paced, forward-looking environment with state-of-the-art GPU clusters.
- Opportunities for growth, upskilling, and cutting-edge tech exposure.