Research Engineer – Training Infra

Snorkel AI Snorkel AI · Data AI · Redwood City, CA +1 · Remote · 316 - Research

The role focuses on building and operating the infrastructure for model training and evaluation, including GPU clusters, training pipelines, and orchestration systems. It involves managing ML frameworks, experiment tracking, and ensuring reliable, scalable execution of training jobs.

What you'd actually do

  1. Set up and manage GPU cluster infrastructure on major cloud providers (e.g., AWS HyperPod) for distributed model training, including networking, provisioning, and cost tracking.
  2. Build and operate job orchestration and scheduling systems (e.g., Kubernetes, Slurm, or cloud-native equivalents) to reliably launch and manage training, rollout, and evaluation jobs across multi-node clusters.
  3. Integrate and maintain ML training frameworks and post-training pipelines, ensuring they run stably and reproducibly at scale.
  4. Set up and maintain experiment tracking, dataset versioning, and model artifact management to support fast iteration.
  5. Monitor and optimize cluster health, inter-node communication, and resource utilization; implement fault tolerance and auto-recovery so long-running jobs survive node failures.

Skills

Required

  • GPU cluster management
  • cloud provider infrastructure (AWS)
  • Kubernetes or Slurm
  • ML training frameworks
  • experiment tracking tools
  • dataset versioning
  • model artifact management
  • Python
  • software engineering fundamentals
  • version control
  • modular design
  • automation

Nice to have

  • AWS HyperPod
  • Slurm
  • cloud-native equivalents
  • distributed training concepts
  • parallelism strategies
  • memory optimization techniques
  • inter-node communication
  • fault tolerance
  • auto-recovery
  • supervised fine-tuning (SFT)
  • reinforcement learning (RLHF, GRPO, or similar)

What the JD emphasized

  • own the infrastructure
  • build and operate GPU cluster infrastructure
  • training pipelines
  • tooling that allows our research and engineering teams to run experiments reliably and at scale
  • translating training requirements into robust, reproducible systems
  • proactively removing infrastructure blockers
  • operational excellence
  • complex distributed systems problems
  • real ownership
  • distributed model training
  • job orchestration and scheduling systems
  • ML training frameworks
  • experiment tracking
  • dataset versioning
  • model artifact management
  • cluster health
  • inter-node communication
  • resource utilization
  • fault tolerance
  • auto-recovery
  • long-running jobs survive node failures
  • understand requirements
  • unblock experiments
  • evolve infrastructure
  • training workloads needs change
  • Hands-on experience managing GPU clusters
  • provisioning
  • network configuration
  • cost management
  • distributed compute orchestration tools
  • cluster management systems
  • distributed training concepts
  • parallelism strategies
  • memory optimization techniques
  • inter-node communication
  • setting up, managing, and integrating ML experiment tracking
  • data/model versioning tools
  • Python proficiency
  • solid software engineering fundamentals
  • version control
  • modular design
  • automation
  • fast-moving, iterative environment
  • end-to-end ownership
  • ambiguous infrastructure problems
  • post-training workflows
  • supervised fine-tuning (SFT)
  • reinforcement learning (RLHF, GRPO, or similar)
  • building reliable systems
  • frontier of AI research
  • distributed infrastructure at scale

Other signals

  • GPU cluster infrastructure
  • training pipelines
  • job orchestration and scheduling systems
  • ML training frameworks
  • experiment tracking
  • dataset versioning
  • model artifact management
  • distributed model training