Senior Software Engineer - Cortex Training

Snowflake Snowflake · Data AI · WA-Bellevue, United States · Engineering

Senior Software Engineer for Snowflake's Cortex Training platform, focusing on scaling LLM post-training infrastructure. The role involves designing and building full-stack solutions, optimizing distributed systems for GPU compute, and productionizing research into reliable components for enterprise customers.

What you'd actually do

  1. Design and build across the full stack — from the public training APIs and SDK through the control plane to the GPU data plane.
  2. Scale the distributed systems that make GPU compute serverless — multi-tenant scheduling, placement, and capacity-aware routing across regional GPU pools, with fault tolerance built in.
  3. Drive end-to-end performance at scale — keep the training, inference, and RL loops fast and the data plane responsive under heavy concurrent load, with GPUs kept saturated.
  4. Productionize research building blocks — partner with Snowflake Research to turn state-of-the-art training and inference techniques into reliable, composable components customers can run at enterprise scale.

Skills

Required

  • distributed systems
  • infrastructure
  • Kubernetes
  • production ML systems
  • GPU infrastructure
  • LLM infrastructure
  • PyTorch
  • DeepSpeed/FSDP
  • Ray
  • CUDA/NCCL
  • vLLM
  • reliability
  • throughput
  • cost efficiency

Nice to have

  • MS/PhD in Computer Science
  • Hands-on LLM post-training / modeling experience

What the JD emphasized

  • 5+ years building and shipping production ML systems
  • Strong distributed systems and infrastructure foundation
  • Familiarity with GPU and LLM infrastructure
  • Demonstrated ability to harden complex systems for reliability, throughput, and cost efficiency.

Other signals

  • LLM post-training platform
  • scale distributed systems
  • productionize research