Key Responsibilities
- Consultative Solution Design:Partner with clients to understand their specific Gen AI workloads, dissecting business goals to propose tailored on-premise infrastructure solutions (Compute, Storage, and Networking).
- Technical Architecture & Collaboration:Act as the bridge between the client and our internal Engineering/R&D teams toarchitect robust AI clusters. Translate client requirements into technical specifications and feasible system designs.
- Infrastructure Sizing & Proposal:Lead the creation of technical proposals, including Bill of Materials (BOM), capacity planning (storage/compute sizing), and total cost of ownership (TCO) analysis.
- On-Premise Deployment & Integration:Oversee and assist in the hardware installation, rack configuration, and software stack deployment of high-performance AI systems and storage servers.
- Technical Troubleshooting:Diagnose complex interoperability issues between AI accelerators (GPUs), storage fabrics, and software layers with the assistance of the Engineering team.
- Documentation & Knowledge Transfer:Maintain detailed documentation of solution architectures, proof-of-concept (PoC) results, and technical resolutions.
Requirements
- Education:Bachelor's degree or equivalent in Computer Science, Data Science, Computer Engineering, or a related field.
- Gen AI & AI server Knowledge:Solid understanding of the Generative AI landscape (LLMs and multi-modal) and the High-Performance Computing (HPC) infrastructure required to train/run them.
- Communication:Ability to articulate complex architectural concepts (e.g., cluster networking, storage throughput) to both C-level executives (layman) and IT Directors (technical).
- Hardware Fluency:Deep familiarity with server components includingServer Motherboards, Enterprise CPUs (AMD EPYC/Intel Xeon), Data Center GPUs (NVIDIA H100/A100/L40s), High-speed RAM, and PCIe/NVLink interconnects.
- Storage Expertise:Proven understanding of storage requirements for AI, including differences between Block, File, and Object storage, and the importance of IOPS/Throughput in model training.
- Problem Solving:Strong analytical skills to troubleshoot bottlenecks in hardware performance or software compatibility.
Technical Skills
1. AI Server & Compute Infrastructure:
- GPU Architecture:Knowledge of Multi-GPU configurations,
- Cluster Management:Familiarity with HPC scheduling tools (Slurm) or container orchestration (Kubernetes/K8s) for AI workloads.
- Linux Mastery:Advanced Linux command line proficiency (RHEL, Ubuntu Server), including kernel tuning and driver installation (NVIDIA Drivers, CUDA Toolkit).
2. Storage Server & Data Management:
- High-Performance Storage:Understanding ofNVMe and NVMe-oF(NVMe over Fabrics) for low-latency data access.
- File Systems:Familiarity with Parallel File Systems used in AI (e.g.,Lustre, GPFS/IBM Spectrum Scale, BeeGFS) or high-performance NAS (ZFS).
- Object Storage:Knowledge of S3-compatible object storage for large datasets (e.g.,MinIO, Ceph).
- RAID & Data Protection:Configuration of HW/SW RAID (0, 1, 5, 6, 10) for redundancy and performance optimization.
3. DevOps & MLOps:
- Docker/Containerization (building and deploying AI containers).
- Basic understanding of CI/CD pipelines for model deployment.