
Empowering corporate mentorship for effective learning
Together is a corporate mentorship management platform founded in 2018, headquartered in CityPlace, Toronto, ON. The platform streamlines the mentorship lifecycle, facilitating connections among employees at companies like Heineken, Reddit, and 7-Eleven. With $1.7 million in seed funding, Together a...
Together offers competitive salaries and equity packages, 4 weeks of paid vacation, and a comprehensive health, dental, and vision plan through Honeyb...
Together fosters a culture of autonomy and impact, allowing employees to take on significant responsibilities without bureaucratic constraints. The fo...

Together AI • San Francisco
Together AI is hiring a Site Reliability Engineer to ensure the reliability and performance of user-facing services and production systems. You'll work with Ansible, Terraform, and Kubernetes to build and manage infrastructure. This role requires 2+ years of experience in SRE or a related field.
You have 2+ years of professional experience as a Site Reliability Engineer or in a related field, demonstrating a strong understanding of operational discipline and engineering principles. Your educational background includes a Bachelor's degree in Computer Science or a related field, or equivalent work experience. You possess knowledge of Ansible, including roles and playbooks, as well as Terraform and Kubernetes, which are essential for building and managing infrastructure. Your proficiency in programming and scripting languages allows you to automate processes effectively. You have direct experience in monitoring and observability practices, ensuring that systems are reliable and performant. Your familiarity with cloud services enhances your ability to manage scalable infrastructures. You thrive in collaborative environments, working well with cross-functional teams to achieve common goals.
Experience with additional monitoring tools and practices would be a plus, as would familiarity with incident management systems like PagerDuty. A strong interest in algorithms and distributed systems will help you identify improvements in product architecture from reliability, performance, and availability perspectives.
As a Site Reliability Engineer at Together AI, you will be responsible for keeping all user-facing services and production systems running smoothly. You will participate in an on-call rotation to respond to production incidents, ensuring that any issues are addressed promptly. Your role will involve building and running infrastructure using tools like Ansible, Terraform, and Kubernetes, enabling the scaling of services to accommodate a massive number of concurrent users. You will also build monitoring systems to ensure the highest quality service for customers, designing and implementing operational processes such as deployments and upgrades. Debugging production issues across all services and levels of the stack will be a key part of your responsibilities, as will identifying improvements for the product architecture from a reliability, performance, and availability perspective. You will plan the growth of Together AI’s infrastructure, contributing to the overall success of the organization.
Together AI offers a collaborative work environment where you can thrive as a Site Reliability Engineer. You will have the opportunity to work with cutting-edge technologies and contribute to the reliability of critical systems. The company values your input and encourages you to apply even if your experience doesn't match every requirement. We provide competitive compensation and benefits, fostering a culture of growth and development within the team. Join us in making a significant impact on the reliability and performance of our services.
Apply now or save it for later. Get alerts for similar jobs at Together AI.