
The cloud monitoring platform engineers love
Datadog (NYSE: DDOG) is a leading cloud observability platform that provides monitoring and analytics for applications, infrastructure, and logs. Trusted by over 26,000 customers including major companies like Netflix, Samsung, and Airbnb, Datadog is headquartered in New York City. The company went ...
Datadog offers competitive salaries, equity options, generous PTO policies, and a flexible remote work policy. Employees also benefit from a learning ...
Datadog fosters an engineering-first culture, with 70% of its workforce comprising engineers. The company emphasizes a strong focus on solving complex...

Datadog • Boston, Massachusetts, USA; New York, New York, USA
Datadog is hiring a Staff Software Engineer for their ML Observability team to develop tools for monitoring and improving AI systems. You'll work with technologies like Python, Java, and TensorFlow to enhance observability for LLMs. This role requires expertise in machine learning and software engineering.
You have a strong background in software engineering with a focus on machine learning — your experience includes building and deploying AI systems in production environments. You possess deep knowledge of large language models and generative AI, enabling you to tackle complex challenges in AI observability. Your proficiency in programming languages such as Python and Java allows you to develop robust solutions that enhance AI system performance. You are familiar with containerization and orchestration tools like Docker and Kubernetes, which are essential for deploying scalable applications. You thrive in collaborative environments, working cross-functionally with product, UX, and applied science teams to drive innovation and product-market fit. You are passionate about creating tools that make AI systems understandable and reliable, and you are eager to lead the development of new features that will impact customers positively.
In this role, you will drive the design and implementation of observability features for large language models — your work will involve ideating, prototyping, and scaling new product features that provide insights into generative AI systems. You will collaborate closely with other engineering teams to iterate quickly and ensure that the tools you develop meet customer needs effectively. Your responsibilities will include developing and extending tools for tracing, evaluating, and debugging AI models, ensuring that they perform optimally in production. You will also shape the product direction by applying your deep understanding of AI systems and software engineering to solve open-ended problems in the fast-moving AI landscape. Your contributions will directly impact how customers monitor, troubleshoot, and optimize their LLM-based applications, enabling them to ship AI with confidence.
At Datadog, we value our office culture and the relationships built within our teams — we operate as a hybrid workplace to ensure that our employees can create a work-life harmony that best fits them. You will have the opportunity to work on cutting-edge technology that is shaping the future of AI observability. We offer competitive compensation and benefits, along with a supportive environment that encourages professional growth and development. Join us in building foundational tools that make AI systems observable, understandable, and reliable in the real world.
Apply now or save it for later. Get alerts for similar jobs at Datadog.