
Search by job, company or skills

As a Site Reliability Engineer (SRE), you will build and operate highly available, globally distributed advertising/monetization services. You will improve reliability, scalability, and operability through automation, observability, incident management, and sound engineering practices.
Own reliability across the service lifecycle: design reviews, capacity planning, launch, deployment, operations, and continuous improvement.
Build and operate highly available services across multiple regions/data centers; improve resilience, latency, and scalability.
Develop automation and tooling to reduce toil (deployment, remediation, runbooks, self-healing) using scripting and software engineering best practices.
Define and implement SLOs/SLIs/SLAs; create dashboards and alerting to track service health (availability, latency, errors, saturation).
Lead sustainable incident response: triage, mitigation, root-cause analysis (RCA), and blameless postmortems with actionable follow-ups.
Collaborate with software engineering, security, and compliance stakeholders to meet data governance and regulatory requirements.
3+ years of experience in SRE, DevOps, systems engineering, or production operations for large-scale services.
Strong coding skills in one language: Python or Go or C++ (Java acceptable).
Solid Linux/Unix fundamentals: processes, memory/CPU, filesystems, permissions, and troubleshooting.
Networking fundamentals in cloud environments: TCP/IP, DNS, HTTP/HTTPS, load balancing, basic security concepts.
SQL proficiency and experience with data workflows/ETL is a plus for ads/analytics-related systems.
Strong communication, ownership mindset, and ability to work effectively across global teams.
Experience supporting advertising, recommendation, or high-traffic consumer internet platforms.
Hands-on experience with cloud platforms (AWS/GCP/Azure) and infrastructure-as-code (Terraform/Ansible).
Experience with containers and orchestration (Docker, Kubernetes).
Observability experience with tools such as Prometheus, Grafana, ELK/Splunk, OpenTelemetry.
Experience operating large data systems (streaming, distributed storage/compute) and performance tuning.
With more than 800 consultants working on six continents, Verinon offers tailored solutions to clients. We are specialists in systems-integration around document and content management, collaboration, business intelligence, data warehousing, conversions and interfaces/adapters coupled with a select pool of custom development areas to clients around the globe. We offer a mix of products as well as services around these products which are within our stated areas of specialty.
Our relationship with our clients varies anywhere from one specialist in a form of staff augmentation to full project rollouts for clients where we are developing and deploying across multiple continents. We focus on quality but are aggressive in our costs and timelines by bringing in a mix of delivery options ranging from on-site, off-site, off-shore as well as a mix of hybrid approaches. Our off-shore development teams have earned both an ISO-9001 certification and have been rated CMMi Level 3 capable. Our team is committed to progress where we are well down the path of our CMMi Level 5 validation which we on schedule to earn within the next twelve months.
We have been recognized as one of the fastest growing companies with a recognition that included inclusion into the 2007 group that qualified for the INC 500. Our approach is unique where we recruit both locally and globally for the best resources. Our specialists have origins from each of the habited continents. We embrace and practice diversity not for compliance but out of mere necessity of business.
Job ID: 146091815