Forward Deployed Engineer (gpu Clusters)

Together AI Together AI · Data AI · San Francisco, CA · Customer Success

The Forward Deployed Engineer (FDE) will be a technical partner to customers building large-scale AI models, focusing on GPU cluster infrastructure, networking, storage, and orchestration to ensure stability, optimize performance, and facilitate platform adoption. This role involves hardening clusters, tuning orchestration layers (Kubernetes/SLURM), debugging low-level bottlenecks, building reference designs, and leading benchmarking exercises.

What you'd actually do

  1. Cluster Hardening & Validation: Design and execute rigorous pre-handover test suites (NCCL, DCGM, GPU Burn) to ensure clusters are stable under the extreme stress of multi-node training.
  2. Technical Partnership: Act as the primary technical point of contact for model labs, helping them tune their orchestration layer (Kubernetes or SLURM) for maximum throughput.
  3. Infrastructure Optimization: Profile and debug low-level bottlenecks in InfiniBand (IB) fabrics, NVLink topologies, and high-performance storage systems.
  4. Opinionated Onboarding: Build reference designs and "out-of-the-box" configurations for training frameworks to reduce customer time-to-train.
  5. Benchmarking & Migration: Lead complex benchmarking exercises to demonstrate the performance impact of migrating to new hardware families or Together AI’s optimized infrastructure.

Skills

Required

  • Large-Scale GPU Infrastructure
  • Kubernetes (GPU-operator, device plugins)
  • SLURM
  • InfiniBand
  • RoCE
  • NVLink
  • NCCL
  • parallel file systems
  • object storage
  • Python
  • shell scripting
  • Ansible

Nice to have

  • VAST
  • Weka

What the JD emphasized

  • large-scale GPU infrastructure
  • cluster orchestration
  • high-performance networking
  • storage systems
  • training frameworks
  • customer deployments
  • customer's stack to solve hard problems
  • high-stakes, fast-paced environment of frontier model labs

Other signals

  • large-scale GPU infrastructure
  • cluster orchestration
  • high-performance networking
  • storage systems
  • training frameworks
  • customer deployments