
Search by job, company or skills
RoleOverview
WeareseekinganexperiencedSiteReliabilityEngineer(SRE)/PlatformEngineerwithstronghandsonexpertiseinreliabilityengineering,observability,incidentresponse,andAWScloudoperations.Theidealcandidatehas57+yearsofexperienceworkinginproduction gradeenvironments,withaproventrackrecordimplementingSLOs/SLIs,designingresilientplatforms,anddrivingoperationalexcellencethroughautomationandinfrastructureascode.
Thisisaseniortechnicalrolerequiringdeepengineeringfundamentals,ownershipmindset,andtheabilitytocollaborateacrossengineering,product,andsecurityteams.
KeyResponsibilities
ReliabilityEngineering
Define,implement,andmaintainSLOs,SLIs,andSLAsacrosscriticalservices.
Continuouslymeasureservicehealthandproactivelyimprovereliabilityandperformance.
Driveerrorbudgetpoliciesandreliabilitycenteredengineeringpractices.
Observability&Monitoring
DesignandimplementobservabilityframeworksusingGrafana(preferred),Prometheus,orequivalenttools.
Builddashboards,alerts,tracing,andlogaggregationpipelines.
Ensurefullvisibilityintosystemhealth,performance,andfailuremodes.
IncidentResponseManagement
Leadmajorincidentresponseandpostincidentanalysis.
Ownoncallprocesses,escalationworkflows,andresponserunbooks.
Driverootcauseanalysis(RCA),correctiveactions,andlongtermpreventionstrategies.
ContinuouslyreduceMTTRandimproveoperationalreadiness.
CloudEngineering(AWS)
Build,deploy,andmaintainplatformcomponentsusingAWSservicessuchas:
EC2,ECS,EKS,Lambda,DynamoDB,S3,IAM,APIGateway,CloudWatch,VPC.
Implementsecureandscalablecloudarchitecturesalignedwithbestpractices.
Optimizecost,performance,andoperationalefficiencyofcloudworkloads.
InfrastructureasCode(IaC)
ImplementandmaintainIaCusingTerraform,CloudFormation,orCDK.
DesignreusableIaCmodules,enforcestandards,andensureconsistentenvironmentprovisioning.
Automatecloudinfrastructuredeployment,configuration,andcompliance.
CI/CD&Automation
DesignandmaintainCI/CDpipelinesusingGitHubActions,GitLabCI,Jenkins,BitbucketPipelines,orAWSCodePipeline.
Championautomationpracticesacrossbuild,test,deployment,andoperationalprocesses.
Ensuresecure,reliable,andauditabledeploymentworkflows.
PlatformEngineering
Designandmaintaincontainerizedworkloads(Docker,ECS,EKS).
Manageserviceorchestration,runtimeenvironments,anddeploymentstrategies.
EvaluateandimplementmodernDevOps/SREtoolstoimprovedeveloperproductivityandplatformreliability.
Collaboration&Leadership
Workcloselywithengineering,product,andarchitectureteamstoembedSREbestpractices.
Contributetodocumentation,knowledgesharing,andcontinuousimprovementinitiatives.
Providetechnicalguidanceandmentorshiptojuniorteammembers.
RequiredSkills&Experience(Mandatory)
57+yearsofexperienceinSRE,DevOps,PlatformEngineering,orCloudInfrastructureroles.
Expertknowledgeof:
oSLO/SLI/SLAdesign&management
oObservability&Monitoring(Grafanarequired;Prometheus/ELKaplus)
oIncidentResponse&OnCallOperations
oAWSServices(handson,production-grade)
oInfrastructureasCode(Terraform,CloudFormation,orCDK)
oCI/CDpipelineengineering
Strongunderstandingofdistributedsystems,networking,containers,andcloud-nativearchitectures.
Abilitytotroubleshootcomplexproductionissuesacrossapplication,infrastructure,andnetworklayers.
Strongscriptingskills(Python,Bash,orGoareadvantages).
Excellentcommunication,analyticalthinking,andproblem-solvingskills.
Job ID: 145002721