Site Reliability Engineer

5-7 Years

Save

Early Applicant

Job Description

Job Responsibilities

Design, build, and maintain highly available, scalable, and reliable production systems.
Ensure system uptime, performance, and reliability by proactively monitoring, troubleshooting, and resolving incidents.
Implement and manage monitoring, alerting, and observability solutions (metrics, logs, traces).
Automate operational tasks to reduce manual effort and improve system reliability.
Lead incident response, root cause analysis (RCA), and post-incident reviews.
Collaborate with development teams to define SLIs, SLOs, and error budgets.
Improve CI/CD pipelines to enable safe, fast, and reliable deployments.
Manage capacity planning, performance tuning, and cost optimization.
Ensure security best practices across infrastructure and application layers.
Participate in on-call rotations and provide production support.

Required Skills & Qualifications

Technical Skills

5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles.
Strong experience with Linux/Unix systems and system internals.
Proficiency in at least one programming/scripting language:
Python, Go, Java, or Bash.
Hands-on experience with cloud platforms: AWS / Azure / GCP.
Strong experience with containerization and orchestration: Docker, Kubernetes.
Experience with CI/CD tools: Jenkins, GitHub Actions, GitLab CI, Azure DevOps.
Expertise in monitoring and observability tools: Prometheus, Grafana, ELK/EFK, Datadog, New Relic.
Experience with infrastructure as code (IaC): Terraform, CloudFormation, ARM.
Strong understanding of networking concepts: TCP/IP, DNS, load balancing, firewalls.