
Empowering corporate mentorship for effective learning
Together is a corporate mentorship management platform founded in 2018, headquartered in CityPlace, Toronto, ON. The platform streamlines the mentorship lifecycle, facilitating connections among employees at companies like Heineken, Reddit, and 7-Eleven. With $1.7 million in seed funding, Together a...
Together offers competitive salaries and equity packages, 4 weeks of paid vacation, and a comprehensive health, dental, and vision plan through Honeyb...
Together fosters a culture of autonomy and impact, allowing employees to take on significant responsibilities without bureaucratic constraints. The fo...

Together AI • San Francisco, Singapore, Amsterdam
Together AI is seeking an LLM Inference Frameworks and Optimization Engineer to design and optimize distributed inference engines for large language models. You'll work with technologies like CUDA, TensorRT, and PyTorch to enhance performance and scalability. This role requires expertise in distributed systems and machine learning.
You have a strong background in AI engineering with a focus on inference frameworks and optimization. Your experience includes designing and developing distributed systems that support high-performance AI applications. You are proficient in CUDA and have worked with TensorRT and PyTorch to optimize model performance. You understand the intricacies of GPU and accelerator optimizations, and you are familiar with algorithms that enhance inference efficiency. You thrive in collaborative environments, working closely with hardware and software teams to ensure seamless integration and performance. You are passionate about pushing the boundaries of AI inference and are eager to contribute to innovative projects.
Experience with multimodal models and techniques such as Mixture of Experts (MoE) parallelism is a plus. Familiarity with software-hardware co-design principles will set you apart. You have a keen interest in the latest advancements in AI and are always looking to learn and apply new technologies.
In this role, you will design and develop fault-tolerant, high-concurrency distributed inference engines for text, image, and multimodal generation models. You will implement and optimize distributed inference strategies, including tensor parallelism and pipeline parallelism, to ensure high-performance serving. Your work will involve applying CUDA graph optimizations and TensorRT/TRT-LLM graph optimizations to enhance the efficiency and scalability of large language models. You will collaborate with hardware teams to ensure that the software and hardware components work seamlessly together, contributing to the overall success of the AI infrastructure. You will also engage in research and development to explore new algorithms and techniques that can further improve inference performance.
Together AI provides a dynamic work environment where innovation is encouraged. You will have the opportunity to work on cutting-edge AI technologies and contribute to projects that have a significant impact on the industry. We offer competitive compensation and benefits, along with opportunities for professional growth and development. Join us in shaping the future of AI inference infrastructure and be part of a team that values creativity and collaboration.
Apply now or save it for later. Get alerts for similar jobs at Together AI.