Platform Engineer, Model Shaping

Together AI Together AI · Data AI · San Francisco, CA · Engineering

Platform Engineer focused on building and operating the foundational infrastructure for Together AI's model customization and evaluation platform. This includes backend services, scaling production workflows, and a job orchestration platform across datacenters and heterogeneous hardware. The role emphasizes reliability, CI/CD, and cloud/hybrid environment management.

What you'd actually do

  1. Design and build Together’s systems and infrastructure for model customization, including user-facing features and internal improvements
  2. Contribute to reliability improvements for the platform, participating in an on-call rotation and improving processes for incident response
  3. Create and improve internal tooling for deployment, continuous integration, and observability
  4. Build a job orchestration platform spanning multiple datacenters, supporting a highly heterogeneous hardware landscape
  5. Partner with teams developing internal services, co-designing these services and incorporating them in systems built within Together

Skills

Required

  • 3+ years of experience in building infrastructure or backend components of production services
  • Extensive experience designing, operating, and troubleshooting production Linux environments and Kubernetes-based platforms
  • Strong software engineering background in Python or Go
  • Experienced with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD)
  • Cloud environment (e.g., AWS/GCP/Azure) administration experience, preferably with a hybrid bare-metal/cloud environment
  • Strong communication skills, be willing to document systems and processes and collaborate with peers of varying technical expertise
  • Comfortable operating across the stack, from cluster operations and infrastructure automation to backend service development

Nice to have

  • Developing large-scale production systems with high reliability requirements
  • Pipeline orchestration frameworks (e.g., Kubeflow, Argo Workflows, Flyte)
  • Managing GPU workloads on HPC clusters, ideally with hands-on experience in operating NVIDIA’s networking stack (e.g., NCCL, Mellanox firmware, GPUDirect RDMA)
  • Deployment of services for AI training or inference
  • Networking fundamentals, including TCP/IP, DNS, routing, load balancing, TLS, and network debugging tools
  • Maintaining or contributing to open-source projects

What the JD emphasized

  • building infrastructure or backend components of production services
  • designing, operating, and troubleshooting production Linux environments and Kubernetes-based platforms
  • infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD)
  • Cloud environment (e.g., AWS/GCP/Azure) administration experience, preferably with a hybrid bare-metal/cloud environment
  • operating across the stack, from cluster operations and infrastructure automation to backend service development
  • Managing GPU workloads on HPC clusters, ideally with hands-on experience in operating NVIDIA’s networking stack (e.g., NCCL, Mellanox firmware, GPUDirect RDMA)
  • Deployment of services for AI training or inference

Other signals

  • building foundational layers of Together’s platform for model customization and evaluation
  • design, develop, and operate both the backend services and the underlying systems that enable us to sustainably and reliably scale production workflows
  • build a job orchestration platform spanning multiple datacenters, supporting a highly heterogeneous hardware landscape