Site Reliability Engineer

Crusoe • Dublin - IE

Posted 15h ago🏛️ On-Site Mid-Level Site reliability engineer 📍 Dublin

Apply Now →

Skills & Technologies

Linux Networking Automation

Overview

Crusoe is hiring a Site Reliability Engineer to ensure the reliability and performance of their cloud infrastructure. You'll work with Linux, networking, and automation to maintain high service levels. This role requires experience in SRE practices and distributed systems.

Job Description

Who you are

You have a strong background in Site Reliability Engineering (SRE) practices, with a focus on maintaining high service levels through effective monitoring and automation. Your experience with distributed systems allows you to understand the complexities involved in ensuring reliability and performance. You are proficient in Linux and have a solid understanding of networking principles, which are crucial for troubleshooting and optimizing infrastructure. Your passion for automation drives you to seek out opportunities to improve processes and reduce manual intervention, ensuring that systems run smoothly and efficiently.

You thrive in a collaborative environment, working closely with engineering teams to advise on building resilient code. Your problem-solving skills enable you to anticipate potential issues and implement proactive measures to prevent them from impacting customers. You are committed to continuous improvement and conduct thorough post-mortems to learn from incidents, sharing insights with your team to enhance overall performance. You understand the importance of a customer-centric approach and strive to ensure that clients have reliable access to the virtual machines they depend on.

Desirable

Experience with cloud infrastructure and familiarity with various cloud service providers would be a plus. Knowledge of monitoring tools and practices, as well as experience with incident management, will further enhance your ability to contribute to the team's success. A background in software development can also be beneficial, as it allows for better collaboration with engineering teams.

What you'll do

In this role, you will be responsible for ensuring the reliability and performance of Crusoe's AI platform. You will work on automation and tool development to streamline routine processes, allowing for more efficient operations. Your expertise in SRE practices will guide you in detecting, analyzing, and preventing issues that could affect service levels. You will collaborate with various engineering teams to advise them on best practices for building resilient code, ensuring that systems are designed with reliability in mind.

You will also conduct thorough post-mortems following incidents, identifying root causes and implementing solutions to prevent recurrence. Your proactive approach will help anticipate issues before they impact customers, maintaining the high standards of service that Crusoe is known for. You will play a key role in driving continuous improvement initiatives, working to enhance the overall performance of the infrastructure.

What we offer

At Crusoe, you will be part of a mission-driven team that is dedicated to accelerating the abundance of energy and intelligence through sustainable technology. We offer a collaborative work environment where innovation is encouraged, and your contributions will have a tangible impact on the future of AI and cloud infrastructure. You will have opportunities for professional growth and development, as well as the chance to work on cutting-edge projects that are shaping the industry. Join us in our commitment to responsible and transformative technology solutions.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Crusoe.

Apply Now →Get Job Alerts

About Crusoe

Key Highlights

🎁 Benefits

🌟 Culture