Site Reliability Engineer

Together AI • San Francisco

Posted 4d agoMid-Level Site reliability engineer 📍 San francisco

Apply Now →

Skills & Technologies

Ansible Terraform Kubernetes Linux

Overview

Together AI is hiring a Site Reliability Engineer to ensure the reliability and performance of user-facing services and production systems. You'll work with Ansible, Terraform, and Kubernetes to build and manage infrastructure. This role requires 2+ years of experience in SRE or a related field.

Job Description

Who you are

You have 2+ years of professional experience as a Site Reliability Engineer or in a related field, demonstrating a strong understanding of operational discipline and engineering principles. Your educational background includes a Bachelor's degree in Computer Science or a related field, or equivalent work experience. You possess knowledge of Ansible, including roles and playbooks, as well as Terraform and Kubernetes, which are essential for building and managing infrastructure. Your proficiency in programming and scripting languages allows you to automate processes effectively. You have direct experience in monitoring and observability practices, ensuring that systems are reliable and performant. Your familiarity with cloud services enhances your ability to manage scalable infrastructures. You thrive in collaborative environments, working well with cross-functional teams to achieve common goals.

Desirable

Experience with additional monitoring tools and practices would be a plus, as would familiarity with incident management systems like PagerDuty. A strong interest in algorithms and distributed systems will help you identify improvements in product architecture from reliability, performance, and availability perspectives.

What you'll do

As a Site Reliability Engineer at Together AI, you will be responsible for keeping all user-facing services and production systems running smoothly. You will participate in an on-call rotation to respond to production incidents, ensuring that any issues are addressed promptly. Your role will involve building and running infrastructure using tools like Ansible, Terraform, and Kubernetes, enabling the scaling of services to accommodate a massive number of concurrent users. You will also build monitoring systems to ensure the highest quality service for customers, designing and implementing operational processes such as deployments and upgrades. Debugging production issues across all services and levels of the stack will be a key part of your responsibilities, as will identifying improvements for the product architecture from a reliability, performance, and availability perspective. You will plan the growth of Together AI’s infrastructure, contributing to the overall success of the organization.

What we offer

Together AI offers a collaborative work environment where you can thrive as a Site Reliability Engineer. You will have the opportunity to work with cutting-edge technologies and contribute to the reliability of critical systems. The company values your input and encourages you to apply even if your experience doesn't match every requirement. We provide competitive compensation and benefits, fostering a culture of growth and development within the team. Join us in making a significant impact on the reliability and performance of our services.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Together AI.

Apply Now →Get Job Alerts

About Together AI

Key Highlights

🎁 Benefits

🌟 Culture