Handle SRE role for assigned cloud services owning the KPIs for service reliability, issue to resolution, service deployment, business continuity management, security policy planning, capacity planning, Automation ,etc.
Automation: Automate routine and manual operations tasks to reduce toil and improve efficiency.
Monitoring & Alerting: Implement and use monitoring systems to track system health, set up alerting, and create dashboards.
Incident Management: Respond to and manage incidents to minimize downtime and resolve issues quickly, including on-call support.
System Performance: Measure, analyze, and tune system performance to ensure efficiency and stability.
Infrastructure Management: Provision and manage cloud infrastructure, sometimes using Infrastructure as Code (IaC), and assist in platform management and capacity planning.
Reliability & Resilience: Build sustainable and reliable systems through software engineering practices, which can include resilience testing and chaos engineering.
Key Requirements:
Bachelor's degree or above (or equivalent) in computer science or related discipline.
Be familiar with Linux, Network, Database. Ability to program using one or more high-level languages, such as Python, Java, C/C++, and JavaScript.
Be familiar with containerization technologies like Docker and orchestration tools like Kubernetes.
Be familiar with configuration management and automation tools such as Ansible and Terraform, monitoring, logging, and alerting tools like Splunk, Grafana, or Prometheus.