
Empowering the world through technology and information
Google LLC, headquartered in Mountain View, California, is a global leader in internet-related services and products, including its flagship search engine, Google Search, and the Android operating system. With over 100,000 employees, Google also offers cloud computing services through Google Cloud P...
Google offers competitive salaries, equity options, generous PTO policies, comprehensive health benefits, and a remote work policy that allows flexibi...
Google is known for its engineering-first culture, emphasizing innovation and collaboration. The company fosters a unique environment that encourages ...

Google • Sunnyvale, CA, USA, Kirkland, WA, USA, Seattle, WA, USA
Google is seeking a Senior Staff Software Engineer in Site Reliability Engineering to lead projects and optimize large-scale systems. You'll work with technologies like Kubernetes and Python, focusing on distributed systems and machine learning. This role requires 8+ years of experience in software development.
You have a Bachelor’s degree in Computer Science or a related field, along with 8 years of experience in software development across various programming languages. Your background includes at least 4 years of leading projects and providing technical leadership, demonstrating your ability to guide teams through complex challenges. You possess a deep understanding of distributed systems, having designed, analyzed, and troubleshot them effectively over the past 3 years. A Master's degree or PhD in Computer Science or a related technical field is preferred, showcasing your commitment to advancing your knowledge in the industry.
Your experience includes infrastructure optimization, performance analysis, and cost reduction in large-scale environments, which are crucial for the role. You are familiar with Google storage systems such as Colossus, Bigtable, and Spanner, and understand resource management systems like Kubernetes and Flex. Your knowledge extends to cluster management and scheduling algorithms, which are essential for maintaining system reliability. Additionally, you have familiarity with machine learning hardware accelerators, including TPUs and GPUs, and their lifecycle management, which is increasingly important in today's tech landscape.
You excel in communication and collaboration, able to build consensus across organizational boundaries. Your role as a Site Reliability Engineer combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. You understand the importance of ensuring that Google Cloud's services maintain reliability and uptime, meeting customer needs while continuously improving performance.
Experience with advanced machine learning concepts and their application in real-world scenarios would be a plus. Familiarity with CI/CD pipelines and incident management practices can further enhance your fit for this role. You are also encouraged to bring any additional skills that could contribute to optimizing existing systems and building infrastructure.
In this role, you will collaborate closely with engineering partners to design and deliver joint engineered solutions that meet customer needs. You will identify, scope, and solve broad and ambiguous challenges that impact the efficiency, reliability, and cost-effectiveness of the entire ML fleet. Your ability to turn these challenges into strategic opportunities and actionable plans will be key to your success. You will work on optimizing existing systems, ensuring they are robust and scalable to handle increasing demands.
You will be responsible for maintaining system capacity and performance, keeping a watchful eye on metrics that matter. Your contributions will directly impact the reliability of Google Cloud services, ensuring they meet the high standards expected by customers. You will also engage in incident response, troubleshooting issues as they arise, and implementing solutions to prevent future occurrences.
Google offers a dynamic work environment where innovation is encouraged. You will have the opportunity to work with cutting-edge technologies and collaborate with some of the brightest minds in the industry. The company values your contributions and provides a platform for professional growth and development. Competitive compensation and benefits are part of the package, reflecting the importance of your role in maintaining the reliability of Google Cloud services. We encourage you to apply even if your experience doesn't match every requirement, as we value diverse perspectives and backgrounds.
Apply now or save it for later. Get alerts for similar jobs at Google.