Senior Platform Reliability Engineer with a notable MNC company within the Financial Services Sector
Your new company
We are looking for an experienced Senior Platform Reliability Engineer to join a high-performing infrastructure team responsible for delivering and supporting a large-scale, containerised platform environment.
In this role, you will contribute to building and operating distributed platform solutions, ensuring high levels of reliability, security, and performance. You will collaborate with cross-functional teams to resolve complex technical issues, enhance system efficiency, and promote engineering best practices across the organisation.
Your new role
- Oversee the availability, performance, and resilience of an enterprise container platform and its supporting infrastructure, including capacity planning, monitoring, and incident management.
- Proactively identify system reliability risks by evaluating dependencies, troubleshooting recurring issues, and addressing performance bottlenecks to improve overall stability and cost efficiency.
- Provide advanced-level production support, working closely with engineering teams to investigate and resolve platform and application-related incidents.
- Participate in a 24/7 on-call rotation, responding to monitoring alerts and restoring services promptly to ensure minimal disruption.
- Continuously improve operational processes by identifying manual tasks and implementing automation to increase efficiency and reduce human error.
- Perform regular system upgrades, patching, and maintenance activities to maintain security and platform integrity.
- Work with modern engineering tools and frameworks, including open-source technologies, CI/CD pipelines, version control systems, and container orchestration platforms (e.g., Kubernetes, Docker).
- Stay updated on emerging technologies and recommend enhancements to improve system capabilities and engineering practices.
What you'll need to succeed
- Degree in Computer Science, Engineering, or a related discipline.
- 5-7 years of IT experience, including at least 3-5 years in a Site Reliability Engineering (SRE) or Platform Engineering role managing containerised environments.
- Hands-on experience with container orchestration platforms (Kubernetes or similar enterprise solutions).
- Strong background in automation using tools such as Ansible and scripting languages like Python or Bash.
- Experience developing and maintaining Helm charts and repositories.
- Familiarity with software-defined networking or similar enterprise networking solutions.
- Relevant certifications (e.g., Kubernetes certifications) are advantageous.
- Proven experience in high-pressure, fast-paced environments supporting mission-critical systems.
- Strong understanding of reliability engineering concepts, including scalability, observability, and performance tuning.
- Experience with monitoring and observability tools (e.g., dashboards, metrics tracking, SLIs/SLOs/SLAs).
- Hands-on experience with CI/CD pipelines and tools for build, deployment, and version control.
- Ability to troubleshoot infrastructure, application, and networking issues effectively.
What you'll get in return
- Work on enterprise-scale, high-availability systems
- Exposure to modern cloud-native technologies and DevOps practices
- Opportunity to influence platform engineering standards and innovation
- Collaborative environment with strong focus on growth and improvement
What you need to do now
If you're interested in this role, click apply now to forward an up-to-date copy of your CV, or call us now.
If this job isn't quite right for you, but you are looking for a new position, please contact us for a confidential discussion about your career.