We are seeking a skilled and proactive MLOps / LLM DevOps Engineer to support the deployment, monitoring, optimization, and governance of Large Language Model (LLM) systems in production environments. This role bridges AI engineering and DevOps, ensuring scalable, secure, and reliable LLM-powered services.
The ideal candidate has hands-on experience managing ML/AI infrastructure, containerized deployments, CI/CD pipelines, and production monitoring. You will work closely with AI Engineers and software teams to operationalize LLM applications such as RAG systems, AI agents, document intelligence platforms, and conversational AI services.
This role focuses on production reliability, performance tuning, observability, cost optimization, and responsible AI governance across the full LLM lifecycle.
Responsibilities
- Design, implement, and maintain production-grade LLM infrastructure and deployment pipelines.
- Build CI/CD workflows for AI model training, fine-tuning, evaluation, and deployment.
- Deploy and manage LLM services using Docker and Kubernetes in cloud or hybrid environments.
- Implement scalable RAG pipelines, vector databases (e.g., FAISS, Chroma, Pinecone), and inference endpoints.
- Monitor LLM systems for latency, throughput, cost, hallucination rate, and model drift.
- Establish observability frameworks (logging, tracing, metrics) for AI services.
- Optimize GPU/CPU resource utilization and inference performance.
- Manage model versioning, experiment tracking, and artifact storage using tools.
- Ensure secure API management, authentication, and data privacy compliance in LLM systems.
- Implement guardrails, rate limiting, caching (CAG), and fallback mechanisms.
- Support responsible AI practices, including prompt logging, bias monitoring, and audit trails.
- Collaborate with AI Engineers to transition prototypes into stable production systems.
- Maintain documentation for architecture, workflows, and operational procedures.
Qualifications
- Bachelor's degree in Computer Science, Artificial Intelligence, Data Science or related fields.
- 3 – 5 years of experience in MLOps, DevOps, or AI infrastructure roles.
- Experience managing containerized applications using Docker and Kubernetes.
- Experience building CI/CD pipelines for ML workflows.
- Familiarity with cloud environments (AWS, Azure, GCP, or Huawei Cloud).
- Strong proficiency in Python and scripting (Bash).
- Experience with containerization (Docker) and orchestration (Kubernetes).
- Familiarity with LLM frameworks (Hugging Face, OpenAI API, vLLM, LangChain).
- Experience managing vector databases and RAG architectures.
- Knowledge of API gateways, microservices architecture, and RESTful services.
- Understanding of GPU management and model inference optimization.
- Experience with monitoring tools (Prometheus, Grafana, ELK stack etc.).
- Familiarity with CI/CD tools (GitHub Actions, GitLab CI, Jenkins).
- Experience with experiment tracking tools (MLflow, Weights & Biases).
- Knowledge of security best practices, data governance, and secrets management.