Senior Site Reliability Engineer

Clarifai • Remote (Canada)

Posted 14h ago🏠 Remote Senior Site reliability engineer 📍 Canada

Apply Now →

Skills & Technologies

Kubernetes Python Go Microservice architecture Cloud infrastructure

Overview

Clarifai is seeking a Senior Site Reliability Engineer to ensure the smooth operation and high availability of their AI platform. You'll work with Kubernetes, Python, and Golang to address infrastructure challenges. This role requires expertise in cloud systems and microservice architecture.

Job Description

Who you are

You have 5+ years of experience in site reliability engineering, focusing on maintaining high availability and performance of distributed systems. Your background includes working with Kubernetes and cloud infrastructure, allowing you to tackle complex challenges in a cloud-native environment. You are proficient in programming languages such as Python and Golang, which you use to develop and maintain tools that enhance system reliability. Your understanding of microservice architecture principles enables you to design scalable and efficient systems that meet the demands of modern applications. You are familiar with security best practices for cloud-based systems, ensuring that the infrastructure remains secure and compliant. You thrive in collaborative environments, working closely with development teams to implement best practices in system monitoring and incident response.

Desirable

Experience with relational databases, message queues, and key-value stores is a plus, as it allows you to optimize data flow and storage solutions. Familiarity with RPC frameworks can enhance your ability to integrate various services within the architecture. You have a keen interest in developing custom Kubernetes operators, which showcases your initiative to improve operational efficiency and automation within the infrastructure.

What you'll do

In this role, you will be responsible for ensuring the smooth operation of Clarifai's core services, which involves monitoring system performance and identifying bottlenecks. You will implement solutions to enhance system reliability and performance, addressing issues proactively before they impact users. Your work will involve collaborating with engineering teams to design and deploy scalable infrastructure that supports the training and serving of large neural networks. You will also be tasked with maintaining and improving the cloud infrastructure, ensuring that it meets the evolving needs of the organization. As part of your responsibilities, you will develop and maintain tools that facilitate the deployment and management of applications in a Kubernetes environment. You will participate in incident management processes, helping to resolve issues quickly and effectively while documenting lessons learned to prevent future occurrences. Your expertise will be critical in shaping the operational practices of the team, driving improvements in efficiency and reliability.

What we offer

Clarifai offers a dynamic work environment where innovation is at the forefront of our mission. You will be part of a diverse and inclusive team that values collaboration and creativity. We provide opportunities for professional growth and development, encouraging you to expand your skills and knowledge in the field of site reliability engineering. Our commitment to work-life balance means you can thrive both personally and professionally while contributing to cutting-edge AI solutions. Join us in transforming how organizations leverage AI to gain insights from their data, and be part of a company that is shaping the future of technology.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Clarifai.

Apply Now →Get Job Alerts

About Clarifai

Key Highlights

🎁 Benefits

🌟 Culture