SRE, Chaos Engineering, Search Resilience

Amazon • Tokyo, JPN

Posted 1d ago🏛️ On-Site Mid-Level Site reliability engineer 📍 Tokyo

Apply Now →

Skills & Technologies

Python Java C++Linux Networking Systems engineering

Overview

Amazon is hiring a Site Reliability Engineer for their Chaos Engineering team in Tokyo. You'll design and automate chaos experiments to enhance the resilience of Amazon Search against outages. This role requires experience in programming and systems engineering.

Job Description

Who you are

You have a strong background in systems engineering and a passion for improving service reliability. With experience in programming using modern languages such as Python, Java, or C++, you understand the intricacies of distributed systems and can effectively troubleshoot issues. Your familiarity with Linux and networking concepts allows you to navigate complex environments and contribute to the resilience of critical services. You thrive in collaborative settings, working closely with service owners to identify vulnerabilities and implement solutions that minimize risks. Your analytical mindset drives you to research and adopt best practices in resilience engineering, ensuring that systems remain robust under various conditions. You are committed to continuous learning and improvement, always seeking ways to enhance operational efficiency and service quality.

What you'll do

In this role, you will be responsible for designing, implementing, and executing chaos experiments that test the resilience of Amazon Search against various failure scenarios. You will collaborate with service owners to identify vulnerabilities and develop strategies to mitigate risks, ensuring that customers can reliably find products when they search. Your work will involve developing and maintaining chaos experiment orchestrators and distributed load generators, as well as managing a petabyte-scale log archival and query service. You will participate in a 12/12 on-call rotation for incident response, where your expertise will be crucial in mitigating incidents and restoring service. By leveraging your skills in programming and systems engineering, you will contribute to the overall reliability and performance of Amazon's search infrastructure, making a significant impact on the customer experience.

What we offer

At Amazon, you will be part of a diverse and inclusive team that values collaboration and innovation. We provide a supportive environment where you can grow your skills and advance your career. You will have access to the resources of one of the world's leading internet companies, allowing you to work on cutting-edge technologies and projects. We encourage you to apply even if your experience doesn't match every requirement, as we believe in the potential of diverse backgrounds and perspectives. Join us in our mission to enhance the resilience of Amazon Search and deliver exceptional experiences for our customers.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Amazon.

Apply Now →Get Job Alerts

About Amazon

Key Highlights

🎁 Benefits

🌟 Culture