Job Description
Solution Design & Architecture:
Design and develop a robust infrastructure observability solution in platform such as Splunk, Dynatrace, ElasticSearch etc. to enables real-time monitoring, logging, and tracing at scale across diverse, distributed client architectures.
Integration & Deployment:
Integrate observability tools with existing service management tools (ServiceNow) and monitoring systems (e.g., Prometheus, Dynatrace, Zabbix) using scripts, integration plugin etc. to collect and correlate metrics, logs, and traces for unified visibility
Configuration & Customization:
Develop automation workflows for incident response using ServiceNow and automation platform for infrastructure automation to streamline issue resolution.
Configure and leverage AI and machine learning to enable predictive analytics, anomaly detection, and automated root cause analysis within observability platforms
Design and create dashboards using SQL Based Query Language , alerts, and reports to provide actionable insights into system performance, availability, and user experience.
Advisory & Troubleshooting:
Work closely with clients to assess their observability needs, provide strategic recommendations, and implement tailored solutions aligned with business objectives
Diagnose and resolve issues with observability platforms to ensure platform reliability and performance
Provide ongoing support and troubleshooting to address any issues that arise post-deployment, ensuring that the tools continue to operate effectively.
3-5 years of consulting experience in IT Service Management (ITSM) processes (particularly for event, incident, problem and change management), infrastructure resiliency, or IT operations automation
Hands-on experience with observability platforms such as Splunk, Dynatrace, or Elasticsearch (ELK Stack) etc.
Proficiency in Python, Bash, or PowerShell for scripting and automation.
Familiarity with AI/ML concepts for anomaly detection, event correlation, and predictive analytics
Excellent problem-solving, communication, and client-facing skills to collaborate with cross-functional teams and stakeholders
Experience in SRE roles with a focus on observability and automation is an advantage