Senior Software Engineer - Together Cloud Infrastructure

Together AI Together AI · Data AI · San Francisco, CA · Engineering

Senior Software Engineer focused on building and operating a high-performance, global AI cloud infrastructure platform. This includes designing and maintaining backend services for hardware management, IaaS software layer for GPU data centers, high-performance object storage for pretraining datasets, and advanced observability stacks for distributed pretraining. The role also involves architecture and research for decentralized AI workloads and contributing to the open-source platform.

What you'd actually do

  1. Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning.
  2. Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs.
  3. Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining.
  4. Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining.
  5. Perform architecture and research work for decentralized AI workloads

Skills

Required

  • 5+ years of professional software development experience
  • proficiency in at least one backend programming language (Golang desired)
  • 5+ years experience writing high-performance, well-tested, production quality code
  • Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
  • Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members
  • Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I/O, and scale
  • Experience with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD)

Nice to have

  • Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
  • Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
  • Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
  • Experience with Cluster API or similar a big plus
  • Experience working on high-performance compute, networking, and/or storage a big plus
  • Experience virtualizing GPUs and/or Infiniband a big plus
  • Experience building IaaS or PaaS systems at scale a plus
  • Experience with DPUs/SmartNICs a plus
  • GPU programming, NCCL, CUDA knowledge a plus

What the JD emphasized

  • high-performance
  • highly-available
  • global
  • pretraining
  • distributed pretraining
  • decentralized AI workloads

Other signals

  • AI Acceleration Cloud
  • AI cloud infrastructure
  • ML hardware
  • ML practitioners
  • self-serve AI cloud services
  • inference
  • fine-tuning
  • pretraining
  • decentralized AI workloads
  • open-source Together AI platform