Senior Software Engineer in Hardware Infrastructure Observability

Nebius AI • Amsterdam, Netherlands

Posted 2w ago🏛️ On-Site Senior Software engineering 📍 Amsterdam

Apply Now →

Skills & Technologies

Python Linux Docker Kubernetes Prometheus Grafana

Overview

Nebius AI is seeking a Senior Software Engineer to join their Hardware Infrastructure Observability team. You'll design and develop services for monitoring server fleets and data center systems, utilizing skills in Python and Linux. This role is based in Amsterdam.

Job Description

Who you are

You have 5+ years of experience in software engineering, particularly in building and maintaining infrastructure observability systems. Your expertise in Python and Linux allows you to develop robust monitoring solutions that ensure the reliability of large-scale server fleets. You are familiar with containerization technologies such as Docker and orchestration tools like Kubernetes, which you have used to streamline deployment processes and enhance system performance. Your experience with monitoring tools like Prometheus and Grafana enables you to create insightful dashboards and alerts that help maintain system health. You thrive in collaborative environments, working closely with cross-functional teams to drive improvements and resolve incidents effectively. You are proactive in investigating issues and implementing root-cause fixes, ensuring that systems remain operational and efficient.

Desirable

Experience with cloud infrastructure and AI/ML systems is a plus, as is familiarity with incident response protocols and debugging techniques. You are comfortable working in a fast-paced environment and are eager to learn new technologies that can enhance your contributions to the team.

What you'll do

As a Senior Software Engineer at Nebius, you will be responsible for designing and developing services and agents that provide deep visibility into a large server fleet and data center engineering systems. You will evolve metrics, aggregation, and alerting pipelines to improve signal quality and ensure that the infrastructure remains healthy. Your role will involve building maintenance workflows and automation processes that facilitate safe and predictable fleet-wide changes. You will also investigate incidents hands-on, including on-host debugging, and drive root-cause fixes to enhance system reliability. Collaboration with other engineers and teams will be key as you work to improve the overall performance and efficiency of the infrastructure.

What we offer

Nebius offers a competitive salary and a comprehensive benefits package, along with opportunities for professional growth within the company. You will enjoy flexible working arrangements and be part of a dynamic and collaborative work environment that values initiative and innovation. As Nebius continues to grow and expand its products, you will have the chance to contribute to exciting projects that shape the future of AI cloud infrastructure.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Nebius AI.

Apply Now →Get Job Alerts

About Nebius AI

Key Highlights

🎁 Benefits

🌟 Culture