
Empowering creators in a vibrant gaming universe
Roblox is an online gaming and entertainment platform headquartered in South San Mateo, CA, that connects over 200 million monthly active users. The platform empowers its community to create and monetize their own games, with over $500 million paid out to developers in 2022 alone. As a leader in the...
Roblox offers competitive salaries, equity options, generous PTO policies, and a flexible remote work policy to support work-life balance. Employees a...
Roblox fosters a creator-centric culture, encouraging employees to innovate and collaborate while prioritizing user safety. The company values communi...

Roblox • San Mateo, CA, United States
Roblox is hiring a Senior Site Reliability Engineer to manage and optimize their infrastructure systems. You'll work with Kubernetes, Python, and AWS to ensure reliability and performance at scale. This role requires strong programming skills and experience in site reliability engineering.
You have 5+ years of experience in site reliability engineering, with a strong background in managing large-scale infrastructure systems. Your expertise in Kubernetes and cloud technologies allows you to design and implement resilient systems that can handle millions of users. You are proficient in Python and Linux, enabling you to automate processes and troubleshoot complex issues effectively. You understand the importance of observability and have experience implementing monitoring solutions to ensure system reliability. You are a collaborative team player who enjoys working with cross-functional teams to drive best practices in reliability and performance.
Experience with AWS services is a plus, as it complements your skills in managing cloud infrastructure. Familiarity with Docker and container orchestration will help you in your role, as you work to productionize Kubernetes-based infrastructure. You are passionate about improving system reliability and have a proactive approach to identifying and resolving potential issues before they impact users.
In this role, you will be responsible for designing and developing systems that promote fault tolerance and resilience across Roblox's infrastructure. You will automate the management and lifecycle of clusters, ensuring that systems are observable and maintain high availability. Your work will involve collaborating with the Infra Compute group to institute reliability best practices and drive common reliability initiatives. You will also participate in incident management and post-mortem analysis to continuously improve system performance and reliability.
You will have the opportunity to shape the future of Roblox's infrastructure by contributing to the development of tools and processes that enhance operational efficiency. Your insights will help guide the team in making informed decisions about infrastructure investments and improvements. You will work closely with developers and other engineers to ensure that the systems you build meet the needs of the community and support the growth of the platform.
Roblox offers a dynamic work environment where you can make a significant impact on the future of human interaction through technology. You will be part of a team that values collaboration, innovation, and continuous improvement. We provide competitive compensation and benefits, along with opportunities for professional growth and development. Join us in our mission to connect a billion people with optimism and civility, and help us create safer, more civil shared experiences for everyone.
Apply now or save it for later. Get alerts for similar jobs at Roblox.