Search by job, company or skills

genesis networks pte ltd

Senior AI Infrastructure & Networking Engineer

5-8 Years
SGD 5,300 - 6,000 per month
Save
  • Posted 2 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are seeking an expert Senior AI Infrastructure & Networking Engineer to lead the architecture, deployment, and optimization of our next-generation AI Factory. In this role, you will be responsible for building and scaling high-density GPU supercomputing clusters (up to 512+ nodes) featuring NVIDIA Blackwell UltraB300 systems. You will bridge the gap between heavy physical infrastructure (liquid cooling/busbar power) and advanced logical fabrics, ensuring predictable, line-rate, and lossless transport for massive generative AI training and reasoning workloads.

Key Responsibilities

  • AI Fabric Architecture & Deployment: Design, build, and optimize high-throughput, ultra-low-latency East-West compute networks using NVIDIA Spectrum-X Ethernet platforms (Spectrum-4 ASICs) and/or NVIDIA Quantum-X800 InfiniBand switching.
  • Performance Tuning for Lossless Networking: Configure and fine-tune critical Layer 2/3 lossless transport mechanisms, including Remote Direct Memory Access over Converged Ethernet (RoCE v2), Priority Flow Control (PFC), Explicit Congestion Notification (ECN), and DCQCN.
  • Rail-Optimized Topologies: Implement and maintain non-blocking, multi-plane, full fat-tree network topologies mapped to 8-GPU server architectures to maximize collective communication performance via NCCL (NVIDIA Collective Communications Library).
  • SmartNIC & DPU Management: Deploy and manage high-speed compute network interfaces, including ConnectX-8 SuperNICs (800 Gb/s) and BlueField-3 DPUs for isolated infrastructure management, storage acceleration, and multi-tenant security.
  • Full-Stack Orchestration & Automation: Drive infrastructure-as-code deployments using Ansible and Terraform. Initialize and monitor the NVIDIA Network Operator within core Kubernetes orchestration layers.
  • Telemetry & Validation: Utilize deep network telemetry tools such as NVIDIA NetQ and What Just Happened (WJH) to stream real-time switch diagnostics. Conduct line-rate cluster benchmarking using ib_write_bw and ib_write_lat to eliminate physical layer bottlenecks.
  • Cross-Functional Infrastructure Alignment: Collaborate closely with data center facility teams on high-density environment metrics (15-20 kW+ per rack, liquid-cooled rows, Coolant Distribution Units (CDUs), and Rear Door Heat Exchangers). Ensure operational verification aligns with international standards (e.g., IDCA G-Grade or Uptime Institute).

Required Technical Skills &Qualifications

  • Education: Bachelor's or Master's degree in Computer Science, Network Engineering, Systems Engineering, or a related technical discipline.
  • AI Networking Expertise: Proven track record of configuring RoCE v2, adaptive routing, and traffic optimization specifically for machine learning/HPC workloads.
  • Hardware Familiarity: Deep understanding of high-density scale-up and scale-out systems (NVIDIA HGX/DGX architectures, PCIe switching, OSFP/QSFP112 optical and copper assemblies).
  • Software & Cluster Management: Experience with cluster deployment suites like NVIDIA Mission Control, Base Command Manager, Run:ai, or similar enterprise MLOps frameworks.
  • Routing Protocols: Strong proficiency with advanced datacenter networking protocols, particularly eBGP IPv6 unnumbered underlays and EVPN/VXLAN overlays for multi-tenant isolation.
  • Cabling & Layer 1 Validation: Experience managing complex structured fiber trunking (MPO-12/MPO-24 APC) and executing layer-1 diagnostics (ibdiagnet, iblinkinfo).

Preferred Certifications

  • NVIDIA Certified Professional - AI Networking (NCP-AIN)(Highly Preferred)
  • NVIDIA Certified Expert - Cloud End-to-End Fabric (NCE-CEF)
  • Advanced networking tracks from major vendors (e.g., CCIE, JNCIE, or Nokia Service Routing Architect) combined with proven data center fabric experience.

What We Offer

  • Opportunity to work with first-of-its-kind, world-class AI supercomputing technologies (NVIDIA Blackwell Ultra).
  • High-impact role shaping the foundational architecture for enterprise generative AI and large-scale LLM initiatives.
  • Competitive salary, comprehensive benefits package, and continuous learning paths for advanced AI operations certifications.

More Info

Job Type:
Industry:
Employment Type:

Job ID: 149283579

Similar Jobs

Singapore, Kaki Bukit

Skills:

TerraformAnsibleInfrastructure ManagementKubernetesFull Stack Developmentpriority managementDesign-BuildNvidiaperformance metricsInfrastructure DeploymentHigh Speed Ethernet