
The everything store and cloud computing leader
Amazon, headquartered in South Lake Union, Seattle, WA, is the world's largest online retailer and a leader in cloud computing through Amazon Web Services (AWS). With over 1.5 million employees globally, Amazon operates in various sectors, including AI with its Alexa devices and a vast marketplace k...
Amazon offers competitive salaries, stock options, generous PTO policies, and comprehensive health benefits. Employees also have access to a learning ...
Amazon's culture is driven by customer obsession and a focus on innovation. The company encourages employees to think big and move fast, fostering an ...

Amazon • Tokyo, JPN
Amazon is hiring a Site Reliability Engineer for their Chaos Engineering team in Tokyo. You'll design and automate chaos experiments to enhance the resilience of Amazon Search against outages. This role requires experience in programming and systems engineering.
You have a strong background in systems engineering and a passion for improving service reliability. With experience in programming using modern languages such as Python, Java, or C++, you understand the intricacies of distributed systems and can effectively troubleshoot issues. Your familiarity with Linux and networking concepts allows you to navigate complex environments and contribute to the resilience of critical services. You thrive in collaborative settings, working closely with service owners to identify vulnerabilities and implement solutions that minimize risks. Your analytical mindset drives you to research and adopt best practices in resilience engineering, ensuring that systems remain robust under various conditions. You are committed to continuous learning and improvement, always seeking ways to enhance operational efficiency and service quality.
In this role, you will be responsible for designing, implementing, and executing chaos experiments that test the resilience of Amazon Search against various failure scenarios. You will collaborate with service owners to identify vulnerabilities and develop strategies to mitigate risks, ensuring that customers can reliably find products when they search. Your work will involve developing and maintaining chaos experiment orchestrators and distributed load generators, as well as managing a petabyte-scale log archival and query service. You will participate in a 12/12 on-call rotation for incident response, where your expertise will be crucial in mitigating incidents and restoring service. By leveraging your skills in programming and systems engineering, you will contribute to the overall reliability and performance of Amazon's search infrastructure, making a significant impact on the customer experience.
At Amazon, you will be part of a diverse and inclusive team that values collaboration and innovation. We provide a supportive environment where you can grow your skills and advance your career. You will have access to the resources of one of the world's leading internet companies, allowing you to work on cutting-edge technologies and projects. We encourage you to apply even if your experience doesn't match every requirement, as we believe in the potential of diverse backgrounds and perspectives. Join us in our mission to enhance the resilience of Amazon Search and deliver exceptional experiences for our customers.
Apply now or save it for later. Get alerts for similar jobs at Amazon.