
Job Summary
Key Responsibilities
- Define and drive the vision, strategy, and execution of SRE initiatives aligned with company goals.
- Own the uptime, latency, performance, and monitoring of all infrastructure and services.
- Partner with development, QA, and product teams to embed reliability practices early in the software development lifecycle (shift-left).
- Build and enforce SLAs, SLOs, and SLIs across all services, ensuring continuous improvement.
- Lead incident management processes and postmortems with a focus on blameless culture and systemic improvement.
- Design and maintain CI/CD pipelines and infrastructure as code (IaC) practices.
- Identify and eliminate toil by promoting automation, self-healing systems, and tooling.
- Drive capacity planning, cost optimization, and service scalability in cloud or hybrid environments.
- Ensure compliance with security, privacy, and regulatory standards related to infrastructure.
Qualifications
- Deep knowledge of cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and modern DevOps practices.
- Proficient in monitoring tools (e.g., Prometheus, Grafana, Datadog), incident response systems, and observability platforms.
- Strong programming/scripting knowledge (e.g., Python, Go, Bash, or similar).
- Demonstrated success building high-availability systems at scale.
- Exceptional leadership, communication, and stakeholder management skills.
Preferred Qualifications
- Certifications in cloud architecture or site reliability engineering (e.g., Google SRE, AWS DevOps).
- Exposure to zero-trust security, FinOps, or regulated environments (e.g., healthcare, finance).
Didn’t find the job appropriate? Report this Job