
The all-in-one cryptocurrency trading platform
OKX is a leading cryptocurrency exchange headquartered in Mahe, Seychelles, serving over 50 million active users across 180 countries. The platform offers a comprehensive suite of services, including crypto trading, NFT marketplaces, and decentralized finance (DeFi) products. With a commitment to se...
OKX provides a comprehensive insurance package covering medical, dental, vision, disability, and life insurance. Employees enjoy paid parental leave, ...
OKX fosters a culture focused on accessibility in cryptocurrency trading, aiming to demystify the crypto space for users of all levels. The company va...

OKX • Hong Kong, Hong Kong SAR
OKX is seeking a Senior Site Reliability Engineer to design and lead the stability architecture for large-scale distributed systems. You'll work with technologies such as AWS, Docker, and Kubernetes to enhance service stability. This role requires significant experience in infrastructure management and system reliability.
You have over 5 years of experience in site reliability engineering, focusing on large-scale distributed systems. Your expertise in designing and implementing stability architectures has allowed you to proactively manage risks and enhance user experiences. You are proficient in AWS and have a strong command of containerization technologies like Docker and orchestration tools such as Kubernetes. Your background in Linux systems administration equips you with the skills to optimize performance and reliability across various environments. You are also skilled in scripting and automation, particularly with Python, which you use to streamline processes and improve system efficiency. You understand the importance of infrastructure as code and have hands-on experience with Terraform to manage and provision infrastructure effectively.
Experience with big data platforms and data warehouses is a plus, as is familiarity with monitoring and alerting tools. You are comfortable working in a fast-paced environment and can adapt to changing priorities while maintaining a focus on delivering high-quality results. You thrive in collaborative settings and enjoy mentoring junior engineers, sharing your knowledge to foster a culture of continuous improvement.
In this role, you will lead the design and implementation of stability architecture for OKX's large-scale distributed systems. You will develop and optimize infrastructure to ensure high availability and performance, focusing on proactive risk management. Your responsibilities will include creating and maintaining CI/CD pipelines to facilitate smooth deployments and updates. You will also be involved in incident management, working to quickly resolve issues and implement solutions to prevent future occurrences. Collaborating with cross-functional teams, you will ensure that service stability is a core focus in product development and deployment processes. You will analyze system performance metrics and make data-driven decisions to enhance reliability and efficiency. Additionally, you will contribute to the development of best practices and standards for site reliability engineering within the organization.
At OKX, you will be part of a dynamic team that values innovation and collaboration. We offer competitive compensation and benefits, along with opportunities for professional growth and development. Our culture emphasizes shared values and a commitment to doing the right thing, fostering an environment where every team member can thrive. You will have the chance to work on cutting-edge technologies in the crypto space, contributing to projects that have a significant impact on the industry. Join us in reshaping the future of finance through crypto and decentralized applications.
Apply now or save it for later. Get alerts for similar jobs at OKX.