Site Reliability Engineer

Accion Labs Sdn Bhd

Kuala Lumpur

5-8 Years

Save

Posted 8 hours ago
Be among the first 10 applicants

Early Applicant

Quick Apply

Job Description

Job Description

We are seeking a highly experienced and proactive Senior Site Reliability Engineer to join our engineering team in Malaysia. The successful candidate will own the reliability, scalability, and performance of our production systems — with deep expertise in Kubernetes-based infrastructure, cloud platforms (AWS/Azure), configuration management with Ansible, and fullstack observability covering logging, tracing, and metrics. Beyond hands-on platform work, this role carries technical leadership responsibilities including mentoring engineers, defining SRE standards, and driving a culture of reliability and operational excellence across the organization. The candidate will be responsible for managing cloud resources, observability pipelines, and incident response for systems that serve Vietnamese clients and integrate with Vietnamese platforms. Fluency in Vietnamese is required to collaborate with Vietnam-based engineering teams, read technical documentation written in Vietnamese, and lead incident bridges involving Vietnamese stakeholders

Roles and Responsibilities

Kubernetes & Container Platform

o Own, operate, and continuously improve production Kubernetes clusters across

Cloud environments (AWS EKS and/or Azure AKS)

o Design and manage Helm charts and Kustomize configurations for scalable, repeatable Kubernetes manifest management

o Implement and maintain autoscaling, resource quotas, namespace management, RBAC, and network policies across clusters

o Manage service mesh deployments (Istio or Linkerd) for traffic management, mTLS, and inter-service observability

o Drive platform upgrades, node pool management, and cluster lifecycle operations with zero-downtime practices

o Evaluate and adopt new Kubernetes ecosystem tooling to improve platform reliability and developer experience

Cloud Infrastructure (AWS / Azure)

o Architect, provision, and maintain cloud infrastructure on AWS and/or Azure using infrastructure-as-code tools (Terraform, Ansible)

o Manage cloud networking components including VPCs, subnets, security groups, load balancers, DNS, and private endpoints

o Implement and enforce cloud security best practices including IAM policies, secrets management (AWS Secrets Manager, Azure Key Vault), and compliance controls

o Optimize cloud resource utilization and cost efficiency through right-sizing, reserved instances, and autoscaling strategies

o Manage managed cloud services including RDS, S3, Azure Blob Storage, Azure Service Bus, and equivalent AWS services as required Configuration Management &

Infrastructure as Code

o Write, maintain, and govern Ansible playbooks and roles for consistent, repeatable server configuration and application deployment

o Manage infrastructure-as-code repositories using Terraform for provisioning cloud resources across AWS and Azure environments

o Enforce GitOps principles using ArgoCD for declarative, auditable, and automated infrastructure and application deployments via GitLab CI pipelines

Observability — Logging, Tracing & Metrics

o Design and own the full observability stack covering metrics, structured logging, and distributed tracing across all production services

o Build and maintain metrics collection and alerting pipelines using Prometheus and Alertmanager, with dashboards in Grafana

o Implement and manage centralized logging infrastructure using the ELK Stack (Elasticsearch, Logstash, Kibana) or Loki/Grafana for log aggregation, search, and analysis

o Deploy and maintain distributed tracing solutions using Jaeger, Tempo, or OpenTelemetry to provide end-to-end request visibility across microservices

o Instrument applications and infrastructure for OpenTelemetry-compliant telemetry collection (metrics, logs, traces) in collaboration with development teams

o Define and implement SLOs, SLIs, and error budgets based on observability data, ensuring alerting is actionable, noise-free, and tied to real user impact

o Continuously improve dashboards, runbooks, and alert thresholds to provide meaningful, real-time visibility into system health and performance Incident

Management & Production Support

o Lead the response to major production incidents, coordinating cross-team investigation and resolution end-to-end

o Leverage observability tooling (metrics, logs, traces) to rapidly diagnose and resolve incidents with minimal MTTR

o Author detailed post-mortem reports and drive systemic preventive measures to eliminate recurring issues

o Establish and mature on-call rotations, escalation procedures, and incident response playbooks

o Participate in on-call rotations as required

Requirements and skills

Education

o Diploma in a related field with a minimum of 5 years of relevant professional experience

o Technical certificate or equivalent qualification in advance

Experience & Technical Skills

o 5+ years of professional experience in Site Reliability Engineering, DevOps, or Platform Engineering

o Deep hands-on expertise in Kubernetes (EKS, AKS, or self-managed) including cluster operations, RBAC, networking, and workload management

o Strong proficiency in Helm and Kustomize for Kubernetes manifest management and GitOps workflows

o Proven experience with cloud platforms AWS and/or Azure, including networking, IAM, managed services, and cost optimization

o Strong infrastructure-as-code experience using Terraform for cloud resource provisioning across AWS and/or Azure

o Solid experience with Ansible for configuration management and automated server provisioning

o Hands-on experience building and managing full observability stacks: Prometheus + Grafana for metrics, ELK Stack or Loki for logging, and Jaeger, Tempo, or OpenTelemetry for distributed tracing

o Strong proficiency with GitLab CI and ArgoCD for CI/CD pipeline design and GitOps-based deployments

o Proven experience defining and managing SLOs, SLIs, and error budgets in production environments

o Experience leading incident response, post-mortems, and implementing systemic preventive measures

o Strong proficiency in at least one scripting or programming language (Python, Go, or Bash) for automation and tooling

Language Requirement (Mandatory for this role)

o Fluent in Vietnamese (both spoken and written) – to collaborate with Vietnam based engineering teams, lead incident responses involving Vietnamese stakeholders, and understand Vietnamese technical documentation

o Good English proficiency – for documentation and collaboration with Malaysia based teams

Nice to Have

o Experience with service mesh technologies (Istio or Linkerd) for traffic management and mTLS

o Familiarity with chaos engineering tools (LitmusChaos, Chaos Monkey, or equivalent)

o Knowledge of Redis, Kafka, or RabbitMQ in production SRE contexts

o Experience with FinOps practices and cloud cost governance tooling

o Experience building AI-assisted automation workflows for SRE and operational use cases

o Relevant cloud certifications (AWS Solutions Architect, Azure Administrator, CKA/CKAD)

Soft Skills

o Strong technical leadership — able to guide and elevate the reliability engineering team

o Structured, specification-first thinking — defines runbooks and standards clearly before executing

o High ownership mindset — takes full accountability for production reliability and platform stability

o Collaborative and effective in Agile/Scrum environments, working seamlessly with development and product teams

o Enthusiastic about AI-assisted operations and proactive in improving team productivity through automation and tooling

o Clear communicator able to translate complex infrastructure concepts for both technical and non-technical stakeholders

More Info

Job Type:

Contract Job

Role:

Other Roles

Function:

Others

About Company

Accion Labs Sdn Bhd

Accion Labs is an -, - headquartered in Pittsburgh. At the core is our mission to enhance lives by transforming businesses through innovation. We focus on applying next-generation technologies to solve complex challenges and accelerate enterprise transformation.

With a global presence across 23 locations and a team of 4,200+ employees, including 1,000+ trained in AI and GenAI, we help organizations modernize through a unique blend of engineering excellence, proprietary IP, and proven execution models. Our delivery methodology is built on a strong operational framework and a mature governance model that fosters true partnership and joint ownership through equal investments.

Job ID: 146538397

Jobs by Skill - IT