empty

Staff Software Engineer - Infrastructure Monitoring

Datadog

Job Description

Posted on: 
February 11, 2025

Summary and company overview

Summary Information about the Role and Company Overview

Datadog is seeking an experienced Staff Engineer to join our Infrastructure Monitoring team. We are looking for a Staff Engineer with deep GPU experience (development + operations) to help build out GPU-specific observability capabilities in our Infrastructure Monitoring products. This role will directly shape Datadog's approach and posture towards building observability tooling for customers leveraging GPUs in their infrastructure. Example problems this person will solve are "How can we detect runtime issues over a fleet of GPUs, isolate the root cause, and provide actionable recommendations to resolve the issue?" and "How can we profile and optimize software running on GPUs?" This will include significant cross teamwork and collaboration with a number of Datadog product and platform teams, requiring the ability to go deep across many different product stacks.

Responsibilities

  • Develop a company-wide approach to GPU Observability across the 3 Pillars - Metrics, Logs, and Traces
  • Collaborate with cross-functional teams to design and develop GPU-centric product offerings
  • Drive high-priority, high-visibility products that expand Datadog's penetration into the GPU market
  • Lead architectural decisions for new and existing GPU-based observability products
  • Identify opportunities for Datadog product enhancements to provide coverage for GPUs
  • Contribute to short- and long-term planning and roadmap development

Job Requirements

Required Qualifications

  • You have several years of experience leading cross-team initiatives in a platform or infrastructure-focused environment
  • You have a deep understanding of, have developed for, and operated GPUs in production environments
  • You are deeply familiar with at least one of the following areas - Data Science, Graphics Programming, Large Language Models
  • You have significant back-end programming experience and have architected, built, and operated distributed systems to solve problems at high scale
  • You possess a deep understanding of the day-to-day responsibilities of an engineer and have a strong technical background
  • You have excellent verbal and written communication skills and are comfortable presenting and defending your ideas to both technical and non-technical audiences
  • You have a BS/MS/PhD in a Computer Science, Engineering or related scientific field or equivalent experience

Additional commentary

NA

Summary and company overview

Summary Information about the Role and Company Overview

Datadog is seeking an experienced Staff Engineer to join our Infrastructure Monitoring team. We are looking for a Staff Engineer with deep GPU experience (development + operations) to help build out GPU-specific observability capabilities in our Infrastructure Monitoring products. This role will directly shape Datadog's approach and posture towards building observability tooling for customers leveraging GPUs in their infrastructure. Example problems this person will solve are "How can we detect runtime issues over a fleet of GPUs, isolate the root cause, and provide actionable recommendations to resolve the issue?" and "How can we profile and optimize software running on GPUs?" This will include significant cross teamwork and collaboration with a number of Datadog product and platform teams, requiring the ability to go deep across many different product stacks.

Apply now