Job Responsibilities
- Design, build, and maintain highly available, scalable, and reliable production systems.
- Ensure system uptime, performance, and reliability by proactively monitoring, troubleshooting, and resolving incidents.
- Implement and manage monitoring, alerting, and observability solutions (metrics, logs, traces).
- Automate operational tasks to reduce manual effort and improve system reliability.
- Lead incident response, root cause analysis (RCA), and post-incident reviews.
- Collaborate with development teams to define SLIs, SLOs, and error budgets.
- Improve CI/CD pipelines to enable safe, fast, and reliable deployments.
- Manage capacity planning, performance tuning, and cost optimization.
- Ensure security best practices across infrastructure and application layers.
- Participate in on-call rotations and provide production support.
Required Skills & Qualifications
Technical Skills
- 5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles.
- Strong experience with Linux/Unix systems and system internals.
- Proficiency in at least one programming/scripting language:
- Python, Go, Java, or Bash.
- Hands-on experience with cloud platforms: AWS / Azure / GCP.
- Strong experience with containerization and orchestration: Docker, Kubernetes.
- Experience with CI/CD tools: Jenkins, GitHub Actions, GitLab CI, Azure DevOps.
- Expertise in monitoring and observability tools: Prometheus, Grafana, ELK/EFK, Datadog, New Relic.
- Experience with infrastructure as code (IaC): Terraform, CloudFormation, ARM.
- Strong understanding of networking concepts: TCP/IP, DNS, load balancing, firewalls.