
Join the conversations that matter to you
Reddit, headquartered in San Francisco, California, is a prominent social media platform that connects over 430 million monthly active users through diverse communities known as subreddits. Founded in 2005, Reddit has raised over $1 billion in funding from investors like Andreessen Horowitz and Sequ...
Reddit offers competitive salaries, equity options, generous PTO policies, and a flexible remote work environment, allowing employees to balance work ...
Reddit fosters a culture of open communication and user-centric innovation, encouraging employees to contribute ideas and engage with the community. T...

Reddit • Remote - United States
Reddit is seeking a Senior Staff ML Infra Engineer to enhance their ML-powered ad ranking systems. You'll work on GPU-first training and serving, focusing on improving efficiency and reliability. This role requires expertise in machine learning and infrastructure.
You have 5+ years of experience in machine learning infrastructure, with a strong background in architecting and optimizing ML training and serving systems. Your expertise in GPU utilization and performance optimization allows you to drive significant improvements in model training and serving latency. You are skilled in writing high-quality design documentation and conducting design reviews, ensuring that standards for correctness and reliability are met across projects. You thrive in cross-organizational collaborations, effectively partnering with multiple teams to enhance shared infrastructure.
You possess a deep understanding of machine learning frameworks and tools, particularly in Python and TensorFlow. Your experience with MLflow and similar platforms enables you to manage the lifecycle of machine learning models efficiently. You are passionate about improving the efficiency of ML systems, focusing on aspects like data loading and feature performance. You are a proactive problem solver, always looking for ways to enhance the reliability and automation of ML systems.
Experience with large-scale ML systems and a portfolio of initiatives that demonstrate your ability to influence and drive adoption of new technologies. Familiarity with cloud platforms and their ML offerings is a plus, as is a background in ad tech or related fields.
In this role, you will architect and significantly influence the ML training and serving systems at Reddit, focusing on GPU-first approaches that unlock faster iteration and larger models. You will drive the adoption and reliability of ML systems, owning a portfolio of initiatives that span multiple teams. Your responsibilities will include improving GPU utilization, training runtime, data loading, feature performance, and serving latency. You will write high-quality design documents and run design reviews, setting standards for correctness, reliability, and velocity.
You will partner with cross-organizational teams to sequence shared infrastructure projects, ensuring that all stakeholders are aligned and informed. Your role will involve mentoring junior engineers and contributing to the overall growth of the ML Infra team. You will also be responsible for identifying and implementing best practices in ML system design and deployment, ensuring that Reddit remains at the forefront of ad technology.
Reddit offers a collaborative and inclusive work environment where you can make a significant impact on the user experience. You will have the opportunity to work on cutting-edge ML technologies and contribute to a product that serves millions of users daily. We provide competitive compensation and benefits, along with opportunities for professional growth and development. Join us in our mission to enhance community engagement through innovative technology.
Apply now or save it for later. Get alerts for similar jobs at Reddit.