AI Research Engineer - Datadog AI Research (dair)

Datadog Datadog · Enterprise · Paris, France · Dev Eng

Research Engineer role focused on building multimodal foundation models for observability and training autonomous agents for SRE incident response. Involves data pipelines, training/evaluation infrastructure, simulation environments, and distributed RL orchestration. Collaborates with researchers to turn ideas into systems and contributes to publications.

What you'd actually do

  1. Build and operate multimodal data pipelines, training and evaluation infrastructure, benchmarks, and internal tooling
  2. Implement models, run experiments at scale, and profile for reliability, performance, and cost
  3. Build simulation environments and replay infrastructure for agent training and evaluation
  4. Orchestrate distributed training and distributed RL with Ray, including scheduling, scaling, and failure recovery
  5. Establish rigorous automated benchmarks and regression tests for world model predictions, agent performance, and simulation fidelity

Skills

Required

  • Python
  • distributed computing
  • ML systems for training and inference at scale
  • implementing and operating ML training and inference systems
  • large-scale model training and fine-tuning
  • containerization
  • orchestration
  • GPU acceleration

Nice to have

  • Rust
  • C++
  • Go
  • Ray
  • Slurm
  • PyTorch
  • JAX
  • Megatron-LM
  • DeepSpeed
  • SkyRL
  • VeRL
  • TorchTitan
  • SFT
  • RLVR
  • RLHF
  • quantization
  • speculative decoding
  • observability
  • SRE
  • security
  • large foundation models
  • world models
  • RL-trained agents
  • GPU programming
  • CUDA
  • production data pipelines
  • simulation environments
  • sandbox environments

What the JD emphasized

  • partner with Research Scientists to turn research ideas into working systems
  • building the data, tooling, and infrastructure that enable rapid iteration, trustworthy evaluation, and a smooth path from prototype to production
  • Training multimodal foundation models
  • Trained Agents for Observability
  • Post-training models to operate autonomously
  • build the simulation environments, RL training loops, and evaluation infrastructure needed to train agents
  • Establish rigorous automated benchmarks and regression tests
  • Contribute to research publications at top-tier conferences
  • depth in distributed computing, RL Infra, and ML systems for training and inference at scale
  • practical experience implementing and operating ML training and inference systems
  • practical experience with large-scale model training and fine-tuning
  • experience supporting or contributing to research publications

Other signals

  • multimodal foundation models
  • autonomous agents
  • RL training loops
  • simulation environments