
Job Description:
As a Site Reliability Engineering (SRE) leader, you will be responsible for leading a team of 10 to 18 SREs to ensure the reliability, scalability, and performance of the platform. This includes managing the team, defining and tracking key metrics, collaborating with other engineering teams, and driving improvements to the development and production environment. The role also involves implementing best practices, staying abreast of industry trends, and building automation to support large-scale deployments.
Has to setup the vision and roadmap for the SRE team.
Required skills and experience:
- Experience with large infrastructure and distributed systems.
- Strong understanding of AWS cloud computing infrastructure and its components.
- Experience with CI/CD pipelines, Kubernetes, and monitoring at scale.
- Proficiency in Infrastructure as Code (Terraform).
- Experience with configuration management tools like Ansible, Chef, or Puppet.
- Strong communication and stakeholder management skills.
- Knowledge of Data Pipeline, MongoDB, ElasticSearch, Kafka, Spark, Samza is an advantage.
Didn’t find the job appropriate? Report this Job