This role is responsible for reliability, availability, user experience, capacity planning, AIOps, process enhancement and digitalization of the cloud-based internet services.
Main responsibilities:
- Handle SRE role for assigned cloud services owning the KPIs for service reliability, issue to resolution, service deployment, business continuity management, security policy planning, capacity planning, Automation ,etc.
- Automation:Automate routine and manual operations tasks to reduce toil and improve efficiency.
- Monitoring & Alerting:Implement and use monitoring systems to track system health, set up alerting, and create dashboards.
- Incident Management:Respond to and manage incidents to minimize downtime and resolve issues quickly, including on-call support.
- System Performance:Measure, analyze, and tune system performance to ensure efficiency and stability.
- Infrastructure Management:Provision and manage cloud infrastructure, sometimes using Infrastructure as Code (IaC), and assist in platform management and capacity planning.
- Reliability & Resilience:Build sustainable and reliable systems through software engineering practices, which can include resilience testing and chaos engineering.
Requirements:
- Full-time bachelor Bachelor's degree or above (or equivalent) in computer science or related discipline.
- Be familiar with Linux, Network, Database. Ability to program using one or more high-level languages, such as Python, Java, C/C++, and JavaScript.
- Be familiar with containerization technologies like Docker and orchestration tools like Kubernetes.
- Be familiar with configuration management and automation tools such as Ansible and Terraform.
- Be familiar with monitoring, logging, and alerting tools like Splunk, Grafana, or Prometheus.
- Have good language communication skills, contingency skills, organization and coordination skills. And Strong analytical and troubleshooting skills for complex systems.