Machine Learning, Platform Engineer

Together AI • San Francisco

Posted 4d ago🏛️ On-Site Senior Machine learning engineer Platform engineer 📍 San francisco

Apply Now →

Skills & Technologies

Cuda PyTorch Kubernetes Apis Docker

Overview

Together AI is hiring a Senior Machine Learning Platform Engineer to build and optimize a container platform for custom models and inference. You'll work with technologies like CUDA, PyTorch, and Kubernetes in San Francisco.

Job Description

Who you are

You have over 5 years of experience in building large-scale, fault-tolerant distributed systems — you've tackled challenges in optimizing performance and ensuring robustness in complex environments. Your expertise includes working with serverless inference platforms and you are familiar with the intricacies of model bring-up and cloud operations.

You possess a strong understanding of container orchestration, particularly with Kubernetes — you know how to manage multi-cluster scheduling and can identify and resolve machine learning bottlenecks effectively. Your background in profiling and optimization allows you to enhance system performance and developer experience.

You are skilled in writing clear, maintainable software and infrastructure as code (IaC) — you understand the importance of documentation and testing strategies to ensure robustness and fault tolerance in your solutions. You thrive in collaborative environments, partnering with product teams to translate functional requirements into technical solutions.

Desirable

Experience with video or audio generation technologies is a plus — you have a keen interest in the latest advancements in machine learning and are eager to apply them in practical scenarios. Familiarity with queueing theory and inference engines will further enhance your contributions to the team.

What you'll do

In this role, you will focus on enabling custom models and dedicated inference on Together's platform — your responsibilities will include building a container platform that optimizes autoscaling and minimizes cold starts. You will analyze and improve the end-to-end model performance, ensuring a best-in-class developer experience with great tooling.

You will work on multi-cluster orchestration and predictive autoscaling — your insights will help in the development of control panes and model optimization strategies. You will also be involved in writing APIs for managing deployments and developing inference worker SDKs and CLI tools.

Your role will require you to conduct design and code reviews — you will create developer documentation and develop testing strategies that enhance the robustness and scalability of existing distributed systems, APIs, databases, and infrastructure. You will collaborate closely with product teams to understand their needs and deliver solutions that meet business objectives.

What we offer

Together AI provides a dynamic work environment where innovation thrives — you will be part of a team that is dedicated to pushing the boundaries of machine learning technology. We encourage you to apply even if your experience doesn't match every requirement, as we value diverse perspectives and backgrounds.

You will have opportunities for professional growth and development — we believe in fostering talent and providing the resources needed to succeed in your career. Join us in shaping the future of AI and making a significant impact in the industry.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Together AI.

Apply Now →Get Job Alerts

About Together AI

Key Highlights

🎁 Benefits

🌟 Culture