This role combines software and systems engineering with the art of machine learning to build and run large-scale, massively distributed, and fault-tolerant systems. You will have the opportunity to sharpen your expertise in coding, performance analysis, and large-scale system design while making a tangible impact on the future of the Company's Infrastructure services and AML systems.
Responsibilities:
- Design, build, and maintain highly available, scalable, and fault-tolerant systems. Collaborate with software engineering teams to ensure applications are designed with reliability and performance in mind.
- Develop and maintain automation procedures to maximize system efficiency, minimize human intervention, and optimize routine tasks.
- Monitor and analyze system performance to identify and address bottlenecks before they impact users. Ensure the infrastructure can handle rapid growth in web traffic and ML data processing.
- Participate in 24/7 on-call rotations (including scheduled shifts and holidays). Practice sustainable on-call response, conduct root-cause analysis, and lead blameless post-mortems to prevent recurrence.
- Implement monitoring tools (SLIs/SLOs/SLAs) and set up automated alerting and metrics to track system health and performance.
- Implement and maintain security best practices and ensure all systems meet regulatory requirements.
Job Requirements:
- Education: Bachelor's or Master's degree in Computer Science, Information Technology, Computer Engineering, or a related field.
- Experience: 3+ years of experience as a Site Reliability Engineer, Systems Engineer, or Software Engineer.
- Coding: Proficient in at least one high-level programming language (e.g., Python, Go, C++, or Java) and shell scripting. Strong understanding of data structures and algorithms.
- Systems: Strong understanding of Linux operating systems and open-source technologies and a solid understanding of network architecture.
- Databases: Competent knowledge of relational database systems and database modeling.
Preferred Qualifications:
- Experience with containers and container orchestration platforms such as Docker and Kubernetes.
- Proficiency in or exposure to machine learning frameworks such as TensorFlow, PyTorch, MXNet, or PaddlePaddle.
- Hands-on experience with monitoring tools and methodologies (e.g., Prometheus, Grafana).
- Soft Skills: Strategic thinking, exceptional communication, and the ability to collaborate effectively with cross-functional teams in a fast-paced environment.