Site Reliability Engineer
Why This Role Matters
At FINEXUS Group, mission-critical platforms underpin Malaysia's financial ecosystem powering high-volume transaction processing, Tier-III data centre services, cloud platforms, and highly regulated environments.
As a Site Reliability Engineer (SRE), you are a key driver of reliability and resilience: ensuring services are available, observable, secure, scalable, and sustainable. Your work directly influences uptime, customer trust, operational risk, regulatory compliance, and the scalability of our platforms.
You bridge engineering and operations to build automation and observability practices that make reliability predictable and measurable while operating in real-world constraints like hybrid infrastructure, audit requirements, and mission-critical uptime expectations.
About the Role
You will design, operate, and continuously improve reliable production systems by applying software engineering principles to operational challenges. You'll own hybrid infrastructure spanning on-premise Tier-III data centres and cloud environments, automate operations using modern toolchains, strengthen observability, and lead incident response.
This is a hands-on reliability role: you will work across Linux and Windows systems, container platforms, and core infrastructure services. You will partner closely with the Application, Cloud, Network, Security, Database, and Compliance teams to maintain operational excellence in a regulated industry.
Mission / Expected Outcomes
- Ensure high availability and performance of FINEXUS platforms (services, infrastructure, and dependencies).
- Reduce operational risk through automation, proactive reliability engineering, and disciplined operations.
- Improve recovery time and incident response maturity through structured practices and continuous learning.
- Enhance observability to enable actionable monitoring, alerting, and operational insights.
- Maintain audit readiness and compliance across systems, infrastructure, and operational processes.
- Drive continuous improvement in scalability, resilience, and operational efficiency.
Key Responsibilities
Reliability Engineering & Platform Operations
- Define and measure SLIs/SLOs, maintain error budgets, and drive improvements based on those signals.
- Perform systems administration for Linux and Windows infrastructure, with strong emphasis on automation and standardisation.
- Own capacity planning, performance tuning, and infrastructure lifecycle management.
- Support backup, disaster recovery readiness, and resilience validation practices.
- Conduct fault-injection and chaos testing to validate reliability assumptions and improve resilience.
- Evaluate and implement automated remediation and self-healing solutions to reduce MTTR and operational toil.
Automation & Infrastructure Engineering
- Build and maintain CI/CD pipelines for reliable, traceable, repeatable deployments.
- Implement Infrastructure-as-Code (e.g., Terraform) and configuration automation (Ansible/Puppet or equivalent).
- Maintain container platforms (e.g., Kubernetes, SUSE Rancher Prime) to support scalable workloads.
- Deliver infrastructure automation that reduces manual toil, improves consistency, and strengthens auditability.
- Support performance and cost optimisation across hybrid environments.
Observability & Monitoring
- Design and deploy monitoring, alerting, logging, and tracing strategies across systems and services.
- Build dashboards and operational insights for engineers, operations teams, and leadership.
- Reduce alert fatigue through better thresholds, correlation, and signal-to-noise improvement.
- Integrate observability across hybrid environments (cloud + on-prem), including infrastructure and platform dependencies.
Security, Compliance & Risk Alignment
- Embed security controls into infrastructure and operational configurations (IAM, segmentation, hardening, secure access).
- Support evidence preparation for internal and external audits, including operational documentation and system reporting.
- Coordinate vulnerability management and patching with security teams, ensuring timely remediation.
- Align operations with PCI DSS / PCI 3DS, ISO 27001, BNM RMiT, SOC 2, and other relevant frameworks.
- Participate in risk assessments and security reviews to ensure reliability practices meet regulatory expectations.
Incident Management & Operational Excellence
- Participate in on-call rotations and lead structured incident response during major production events.
- Maintain and continuously improve runbooks, SOPs, and operational playbooks.
- Conduct post-incident reviews and drive root-cause prevention actions (blameless, measurable, follow-through).
- Use AIOps and event-correlation tools to improve detection, reduce false positives, and shorten time-to-recovery.
Collaboration & Continuous Improvement
- Partner with Application, Database, Network, Cloud, Security, and Compliance teams to deliver resilient systems.
- Advocate SRE best practices: automation, observability, reliability engineering, and operational discipline.
- Continuously evaluate and introduce tools, patterns, and practices to improve FINEXUS reliability posture.
- Mentor peers and contribute to raising the operational maturity of the broader engineering organisation.
Qualifications
Education & Experience
- Bachelor's or Master's degree in Computer Science, IT, Engineering, or related discipline.
- Minimum 5 years experience in SRE, Infrastructure, DevOps, Systems Engineering, or Production Operations.
- Proven experience operating mission-critical systems in enterprise and/or regulated environments.
- Experience supporting audit, compliance, and security requirements in production operations is strongly preferred.
Technical Skills
- Strong command of Linux and production systems operations; Windows administration experience is advantageous.
- Hands-on experience with hybrid environments: on-premise data centres and cloud platforms (AWS/Azure/GCP).
- Strong working knowledge of Kubernetes and container ecosystems; Rancher experience is a plus.
- Experience with CI/CD workflows, automation, and deployment reliability practices.
- Infrastructure-as-Code and configuration automation (Terraform, Ansible/Puppet or equivalent).
- Familiarity with observability stacks (Prometheus, Grafana, ELK, or equivalent).
- Solid networking fundamentals and troubleshooting (DNS, VPN, load balancing concepts, firewalls).
Professional Attributes
- Strong analytical and problem-solving skills; calm under pressure in mission-critical environments.
- Clear communicator in English (Bahasa Malaysia is advantageous).
- High ownership, operational discipline, and reliability mindset.
- Ability to work independently and collaboratively across cross-functional teams.
Certifications (Advantageous)
- CKA, AWS SysOps/DevOps, RHCE, ISO 27001 Implementer, or equivalent.