We are seeking a strategic and technically proficient Head of Site Reliability Engineering (SRE) to lead the design, implementation, and scaling of our reliability, observability, and operational practices.
As the Head of SRE, you will play a critical role in ensuring our systems are highly available, scalable, and performant while maintaining a strong engineering culture of reliability and resilience.

Key Responsibilities

- Lead and mentor a team of SREs responsible for production systems, ensuring operational excellence and system reliability.

- Define and drive the vision, strategy, and execution of SRE initiatives aligned with company goals.

- Own the uptime, latency, performance, and monitoring of all infrastructure and services.

- Partner with development, QA, and product teams to embed reliability practices early in the software development lifecycle (shift-left).

- Build and enforce SLAs, SLOs, and SLIs across all services, ensuring continuous improvement.

- Lead incident management processes and postmortems with a focus on blameless culture and systemic improvement.

- Design and maintain CI/CD pipelines and infrastructure as code (IaC) practices.

- Identify and eliminate toil by promoting automation, self-healing systems, and tooling.

- Drive capacity planning, cost optimization, and service scalability in cloud or hybrid environments.

- Ensure compliance with security, privacy, and regulatory standards related to infrastructure.

Qualifications

- 8+ years of experience in software engineering or infrastructure roles, with 4+ years in an SRE leadership or equivalent role.

- Deep knowledge of cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and modern DevOps practices.

- Proficient in monitoring tools (e.g., Prometheus, Grafana, Datadog), incident response systems, and observability platforms.

- Strong programming/scripting knowledge (e.g., Python, Go, Bash, or similar).

- Demonstrated success building high-availability systems at scale.

- Exceptional leadership, communication, and stakeholder management skills.

Preferred Qualifications

- Experience managing SRE teams across multiple time zones.

- Certifications in cloud architecture or site reliability engineering (e.g., Google SRE, AWS DevOps).

- Exposure to zero-trust security, FinOps, or regulated environments (e.g., healthcare, finance).

Pro

Follow Up

Didn’t find the job appropriate? Report this Job

Posted By

Talent Magnet at MINFY TECHNOLOGIES PRIVATE LIMITED,

Last Active: 19 August 2025

Job Views:
116

Applications: 6

Recruiter Actions: 0

Posted in

IT & Systems

Job Code

1594797

UPSKILL YOURSELF

My Learning Centre

Explore Courses

Banking & Finance Jobs

Sales & Marketing Jobs

Marketing Communication

Consulting - Consumer Goods

Organization Development

Learning and Development

Compensation & Benefits

IT & Systems Jobs

IT Project Management

IT Consulting

Presales

IT Sales

IT Product Management

IT Business Analyst

SCM & Operations Jobs

Regulatory Compliance

Litigation

Company Secretary

Intellectual Property Rights

Legal Jobs in BFSI

Legal Jobs in IT/ITeS

Healthcare Operations

AI Product Management

Generative AI

Gen AI Strategy

AI Project Management