Senior Research Engineer, LLM Evaluation and Behavioral Analysis

Together AI • San Francisco

Posted 2w agoSenior Ai research engineer 📍 San francisco

Apply Now →

Skills & Technologies

Python Machine learning TensorFlow PyTorch SQL

Overview

Together AI is hiring a Senior Research Engineer focused on LLM Evaluation and Behavioral Analysis. You'll develop evaluation frameworks and pipelines to ensure model reliability and performance. This role requires expertise in machine learning and programming skills.

Job Description

Who you are

You have a strong background in machine learning and AI, with at least 5 years of experience in research or engineering roles focused on model evaluation and behavioral analysis. Your expertise in Python and familiarity with frameworks like TensorFlow and PyTorch enable you to build robust evaluation systems. You understand the intricacies of model behavior, including reasoning, tool use, and multi-step interactions, and are adept at identifying subtle failure modes in AI systems.

You possess a solid understanding of evaluation metrics and methodologies, allowing you to design high-quality behavioral test suites that accurately measure model performance. Your experience with SQL and data manipulation equips you to shape datasets effectively and influence model improvements based on empirical evidence. You thrive in collaborative environments, working closely with cross-functional teams to ensure that models behave intelligently and consistently in production.

Desirable

Experience with CI/CD pipelines and automated testing frameworks is a plus, as is familiarity with A/B testing methodologies. You are comfortable working in fast-paced settings and can adapt to evolving project requirements while maintaining a focus on quality and reliability.

What you'll do

In this role, you will build and iterate on evaluation frameworks that measure model performance across various dimensions, including instruction following, function calling, and long-context reasoning. You will develop specialized evaluation suites that assess argument correctness, schema adherence, and tool selection, ensuring that models can handle complex tasks effectively. Your work will involve creating CI/CD automated pipelines for A/B comparisons, regression detection, and behavioral drift monitoring, which are crucial for maintaining high standards of model quality.

You will collaborate with training, post-training, inference, and product teams to identify regressions and shape datasets that drive model improvements. Your insights will directly influence how Together AI measures model quality and reliability across releases, making your contributions vital to the success of the organization.

What we offer

Together AI offers a dynamic work environment where innovation and collaboration are at the forefront. You will have the opportunity to work on cutting-edge open-source-aligned LLMs and inference stacks, contributing to projects that have a significant impact on the AI landscape. We provide competitive compensation and benefits, along with opportunities for professional growth and development. Join us in shaping the future of AI and making a difference in the world of technology.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Together AI.

Apply Now →Get Job Alerts

About Together AI

Key Highlights

🎁 Benefits

🌟 Culture