Search by job, company or skills

IntelliPro

Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 7 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

This role combines software and systems engineering with the art of machine learning to build and run large-scale, massively distributed, and fault-tolerant systems. You will have the opportunity to sharpen your expertise in coding, performance analysis, and large-scale system design while making a tangible impact on the future of the Company's Infrastructure services and AML systems.

Responsibilities:

  • Design, build, and maintain highly available, scalable, and fault-tolerant systems. Collaborate with software engineering teams to ensure applications are designed with reliability and performance in mind.
  • Develop and maintain automation procedures to maximize system efficiency, minimize human intervention, and optimize routine tasks.
  • Monitor and analyze system performance to identify and address bottlenecks before they impact users. Ensure the infrastructure can handle rapid growth in web traffic and ML data processing.
  • Participate in 24/7 on-call rotations (including scheduled shifts and holidays). Practice sustainable on-call response, conduct root-cause analysis, and lead blameless post-mortems to prevent recurrence.
  • Implement monitoring tools (SLIs/SLOs/SLAs) and set up automated alerting and metrics to track system health and performance.
  • Implement and maintain security best practices and ensure all systems meet regulatory requirements.

Job Requirements:

  • Education: Bachelor's or Master's degree in Computer Science, Information Technology, Computer Engineering, or a related field.
  • Experience: 3+ years of experience as a Site Reliability Engineer, Systems Engineer, or Software Engineer.
  • Coding: Proficient in at least one high-level programming language (e.g., Python, Go, C++, or Java) and shell scripting. Strong understanding of data structures and algorithms.
  • Systems: Strong understanding of Linux operating systems and open-source technologies and a solid understanding of network architecture.
  • Databases: Competent knowledge of relational database systems and database modeling.

Preferred Qualifications:

  • Experience with containers and container orchestration platforms such as Docker and Kubernetes.
  • Proficiency in or exposure to machine learning frameworks such as TensorFlow, PyTorch, MXNet, or PaddlePaddle.
  • Hands-on experience with monitoring tools and methodologies (e.g., Prometheus, Grafana).
  • Soft Skills: Strategic thinking, exceptional communication, and the ability to collaborate effectively with cross-functional teams in a fast-paced environment.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 145692389