Senior Site Reliability Engineer

Clarifai • Remote (USA)

Posted 14h ago🏠 Remote Senior Site reliability engineer 📍 United states

Apply Now →

Skills & Technologies

Kubernetes Python Go Microservice architecture Cloud infrastructure Relational databases Message queues

Overview

Clarifai is seeking a Senior Site Reliability Engineer to ensure the smooth operation and high availability of their AI platform. You'll work with Kubernetes, Python, and Golang to tackle infrastructure challenges. This role requires expertise in cloud infrastructure and microservice architecture.

Job Description

Who you are

You have 5+ years of experience in site reliability engineering, focusing on ensuring the availability and performance of distributed systems. Your background includes working with Kubernetes and cloud infrastructure, allowing you to effectively manage and orchestrate complex environments. You are proficient in programming languages such as Python and Golang, enabling you to develop tools and scripts that enhance system reliability. Your understanding of microservice architecture principles helps you design resilient systems that can scale efficiently. You are familiar with security best practices for cloud-based systems, ensuring that the infrastructure remains secure and compliant. Additionally, you have experience with relational databases and message queues, which are critical for maintaining data integrity and communication between services.

Desirable

Knowledge of developing and building custom Kubernetes operators is a plus, as it allows for greater automation and efficiency in managing Kubernetes clusters. Familiarity with various RPC frameworks can enhance your ability to implement efficient communication between microservices. You are always eager to learn and adapt to new technologies, contributing to a culture of continuous improvement within the team.

What you'll do

In this role, you will be responsible for ensuring the smooth operation and high availability of Clarifai's core services. You will monitor system performance, identify bottlenecks, and implement solutions to enhance system reliability. Collaborating with engineering teams, you will address infrastructure challenges related to serving and training large neural networks in both cloud and on-premise environments. Your expertise will guide the development of best practices for incident management and response, ensuring that the team can quickly address any issues that arise. You will also play a key role in capacity planning, helping to forecast resource needs and optimize costs associated with cloud infrastructure.

As part of your responsibilities, you will develop and maintain CI/CD pipelines to streamline deployment processes and improve the overall efficiency of the development lifecycle. You will work closely with cross-functional teams to ensure that infrastructure changes align with product goals and user needs. Your contributions will directly impact the performance and reliability of Clarifai's AI platform, enabling organizations to leverage AI technology effectively.

What we offer

Clarifai offers a collaborative and inclusive work environment where you can thrive as a Senior Site Reliability Engineer. You will have the opportunity to work on cutting-edge AI technology and contribute to projects that have a meaningful impact on various industries. We provide competitive compensation and benefits, along with opportunities for professional growth and development. Join us in our mission to empower organizations with AI-driven insights and solutions.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Clarifai.

Apply Now →Get Job Alerts

About Clarifai

Key Highlights

🎁 Benefits

🌟 Culture