Software Engineer - Training Infrastructure

Baseten · Data AI · San Francisco, CA · EPD

Software Engineer on the Training Infrastructure team responsible for architecting and leading development of the ML training platform, focusing on scheduling, storage, networking, reliability, and observability for research engineers and model developers.

What you'd actually do

  1. Design and architect scalable infrastructure systems for our ML training platform (e.g. scheduling, storage, and networking)
  2. Partner closely with developers and research engineers to translate complex training requirements into technical solutions
  3. Design and architect a global training scheduler
  4. Design and architect reinforcement learning systems and continuous learning pipelines
  5. Drive long-term improvements to improve reliability of systems and velocity of development

Skills

Required

  • Go
  • Kubernetes
  • cloud providers (AWS, GCP)
  • distributed systems concepts
  • performance tuning
  • observability systems

Nice to have

  • Python
  • neo-cloud providers (Crusoe, DigitalOcean, Nebius)
  • distributed storage systems
  • workload orchestration platforms like Temporal or Airflow
  • open source training stack and frameworks (NCCL, PyTorch, Megatron, NemoRL, VeRL, Axolotl, HF Trainier)
  • distributed training techniques (FSDP, DeepSpeed)
  • developing AI products, tooling, or agents

What the JD emphasized

  • ML/AI workloads and MLOps platforms highly valued

Other signals

  • ML training platform
  • deploy, scale, and monitor their workloads
  • scheduling, storage, networking, reliability, and observability of technical systems in the training stack