LLM Inference Frameworks and Optimization Engineer

Together AI • San Francisco, Singapore, Amsterdam

Posted 3w agoMid-Level Ai engineer 📍 San francisco 📍 Singapore 📍 Amsterdam

Apply Now →

Skills & Technologies

Cuda Tensorrt PyTorch Gpu Distributed systems Machine learning Algorithms High-performance computing

Overview

Together AI is seeking an LLM Inference Frameworks and Optimization Engineer to design and optimize distributed inference engines for large language models. You'll work with technologies like CUDA, TensorRT, and PyTorch to enhance performance and scalability. This role requires expertise in distributed systems and machine learning.

Job Description

Who you are

You have a strong background in AI engineering with a focus on inference frameworks and optimization. Your experience includes designing and developing distributed systems that support high-performance AI applications. You are proficient in CUDA and have worked with TensorRT and PyTorch to optimize model performance. You understand the intricacies of GPU and accelerator optimizations, and you are familiar with algorithms that enhance inference efficiency. You thrive in collaborative environments, working closely with hardware and software teams to ensure seamless integration and performance. You are passionate about pushing the boundaries of AI inference and are eager to contribute to innovative projects.

Desirable

Experience with multimodal models and techniques such as Mixture of Experts (MoE) parallelism is a plus. Familiarity with software-hardware co-design principles will set you apart. You have a keen interest in the latest advancements in AI and are always looking to learn and apply new technologies.

What you'll do

In this role, you will design and develop fault-tolerant, high-concurrency distributed inference engines for text, image, and multimodal generation models. You will implement and optimize distributed inference strategies, including tensor parallelism and pipeline parallelism, to ensure high-performance serving. Your work will involve applying CUDA graph optimizations and TensorRT/TRT-LLM graph optimizations to enhance the efficiency and scalability of large language models. You will collaborate with hardware teams to ensure that the software and hardware components work seamlessly together, contributing to the overall success of the AI infrastructure. You will also engage in research and development to explore new algorithms and techniques that can further improve inference performance.

What we offer

Together AI provides a dynamic work environment where innovation is encouraged. You will have the opportunity to work on cutting-edge AI technologies and contribute to projects that have a significant impact on the industry. We offer competitive compensation and benefits, along with opportunities for professional growth and development. Join us in shaping the future of AI inference infrastructure and be part of a team that values creativity and collaboration.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Together AI.

Apply Now →Get Job Alerts

About Together AI

Key Highlights

🎁 Benefits

🌟 Culture