AI Research Engineer - Datadog AI Research (dair)

Datadog Datadog · Enterprise · New York, NY · Dev Eng

Research Engineer role focused on building multimodal foundation models for observability and training autonomous agents for SRE incident response. Involves data pipelines, training/evaluation infrastructure, simulation environments, and distributed RL.

What you'd actually do

  1. Build and operate multimodal data pipelines, training and evaluation infrastructure, benchmarks, and internal tooling
  2. Implement models, run experiments at scale, and profile for reliability, performance, and cost
  3. Build simulation environments and replay infrastructure for agent training and evaluation
  4. Orchestrate distributed training and distributed RL with Ray, including scheduling, scaling, and failure recovery
  5. Establish rigorous automated benchmarks and regression tests for world model predictions, agent performance, and simulation fidelity

Skills

Required

  • Python
  • distributed computing
  • RL Infra
  • ML systems for training and inference at scale
  • PyTorch or JAX
  • containerization
  • orchestration
  • GPU acceleration
  • large-scale model training
  • fine-tuning
  • SFT
  • RLVR
  • RLHF
  • efficient inference
  • quantization
  • speculative decoding
  • systems language (e.g., Rust, C++, or Go)
  • modern cloud and data infrastructure
  • multimodal data pipelines
  • training and evaluation infrastructure
  • benchmarks
  • internal tooling
  • simulation environments
  • replay infrastructure
  • distributed training
  • distributed RL with Ray
  • scheduling
  • scaling
  • failure recovery
  • automated benchmarks
  • regression tests
  • world model predictions
  • agent performance
  • simulation fidelity
  • research publications

Nice to have

  • Ray
  • Slurm
  • Megatron-LM
  • DeepSpeed
  • SkyRL
  • VeRL
  • TorchTitan
  • observability
  • SRE
  • security
  • bridging research prototypes and real-world product applications
  • large foundation models
  • world models
  • RL-trained agents
  • GPU programming
  • optimization
  • CUDA
  • production data pipelines
  • applications
  • simulation or sandbox environments for agent training

What the JD emphasized

  • research publications
  • large-scale model training
  • fine-tuning
  • RLHF
  • autonomous agents
  • world models

Other signals

  • multimodal foundation models
  • autonomous agents
  • RL training loops
  • simulation environments