Site Reliability Engineer L5 - Live SRE

Netflix • USA - Remote

Posted 6h ago🏠 Remote Senior Site reliability engineer 📍 United states

Apply Now →

Skills & Technologies

AWS Docker Kubernetes Prometheus Grafana

Overview

Netflix is seeking a Senior Site Reliability Engineer to support live streaming events by ensuring cloud infrastructure stability and reliability. You'll work with technologies like AWS, Docker, and Kubernetes to handle API traffic during high-demand events.

Job Description

Who you are

You have 5+ years of experience in site reliability engineering, with a strong focus on cloud infrastructure and live event support. Your expertise in monitoring and observability tools allows you to ensure high availability and performance during critical events. You are skilled in implementing load tests and analyzing system behavior under stress, which helps you identify potential bottlenecks and improve system resilience.

Your background includes hands-on experience with AWS and container orchestration tools like Kubernetes and Docker. You understand the intricacies of microservices architecture and are adept at managing API traffic, ensuring seamless communication between services. You are passionate about driving improvements in observability and monitoring, always looking for ways to enhance system performance and reliability.

You thrive in collaborative environments and enjoy working closely with cross-functional teams to deliver exceptional user experiences. Your problem-solving skills enable you to tackle complex challenges, and you are committed to fostering a culture of diversity and inclusion within your team.

Desirable

Experience with real-time streaming technologies and protocols is a plus, as is familiarity with tools like Prometheus and Grafana for monitoring and visualization. You are open to learning new technologies and methodologies, and you embrace opportunities for professional growth and development.

What you'll do

In this role, you will be responsible for supporting Netflix's live streaming events, ensuring that our cloud infrastructure can handle sudden increases in API traffic. You will prepare and execute load tests to validate the performance of critical applications and overall system stability. Your work will directly impact the success of live events, from planning and testing phases to the actual event launch.

You will drive continual improvements in observability and monitoring practices, focusing on solving the thundering herd problem and enhancing system scalability. Collaborating with engineering teams, you will implement end-to-end observability solutions that provide insights into system performance and user experience.

Your role will involve analyzing data to identify trends and potential issues, allowing you to proactively address challenges before they impact users. You will also contribute to the development of best practices for incident management and response, ensuring that our systems remain reliable and performant.

What we offer

At Netflix, you will be part of a dynamic team that is dedicated to delivering high-quality entertainment experiences to millions of viewers worldwide. We offer a competitive salary and benefits package, along with opportunities for professional development and growth. Our culture values innovation, collaboration, and diversity, and we encourage you to apply even if your experience doesn't match every requirement. Join us in shaping the future of entertainment and making a lasting impact on how people enjoy content.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Netflix.

Apply Now →Get Job Alerts

About Netflix

Key Highlights

🎁 Benefits

🌟 Culture