Together Cloud Infrastructure Engineer

Together AI Together AI · Data AI · Amsterdam, Netherlands · Engineering

This role focuses on building and maintaining the AI cloud infrastructure, including services for hardware management, IaaS software layer for GPU data centers, high-performance object storage for pretraining, and advanced observability stacks. The engineer will work on the core Together AI platform, create services and tools, and develop testing frameworks for robustness and fault-tolerance.

What you'd actually do

  1. Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning.
  2. Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs.
  3. Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining.
  4. Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining.
  5. Perform architecture and research work for decentralized AI workloads

Skills

Required

  • 5+ years of professional software development experience
  • proficiency in at least one backend programming language (Golang desired)
  • 5+ years experience writing high-performance, well-tested, production quality code
  • Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
  • Excellent communication skills
  • Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I/O, and scale
  • Experience with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD)

Nice to have

  • Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
  • Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
  • Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
  • Experience with Cluster API or similar a big plus
  • Experience working on high-performance compute, networking, and/or storage a big plus
  • Experience virtualizing GPUs and/or Infiniband a big plus
  • Experience with DPUs/SmartNICs a plus
  • GPU programming, NCCL, CUDA knowledge a plus

What the JD emphasized

  • highly available
  • blazing-fast
  • high-performance
  • fault-tolerant
  • high-performance
  • fault-tolerant

Other signals

  • building the next generation AI cloud platform
  • highly available, global, blazing-fast cloud infrastructure
  • virtualizes cutting-edge ML hardware
  • self-serve AI cloud services
  • serving both our internal SaaS products (inference, fine-tuning) and our external cloud customers