Technical Account Manager (tam), AI Factory

Together AI Together AI · Data AI · San Francisco, CA · Customer Success

This role is a Technical Account Manager focused on the infrastructure supporting large-scale AI GPU deployments for a strategic enterprise customer. The TAM will be the primary technical point of contact, responsible for the end-to-end technical relationship across compute, networking, storage, and facilities, ensuring smooth delivery and operational health. Responsibilities include issue lifecycle management, hardware lifecycle management, advising on infrastructure stack best practices, owning the observability strategy, coordinating operations, and managing capacity expansions. The role requires deep expertise in GPU infrastructure, large-scale networking, enterprise storage, and DC operations, with experience in customer-facing technical roles and AI/HPC infrastructure.

What you'd actually do

  1. Serve as the named technical point of contact for a dedicated strategic customer, owning the end-to-end technical relationship across compute, networking, storage, and facilities
  2. Lead issue lifecycle management, escalation, and RCA authorship across all infrastructure domains in partnership with Support, SRE, DC Ops, and Engineering teams
  3. Own end-to-end RMA coordination and hardware lifecycle management, including acceptance testing, spare inventory management, and hardware health reporting for large-scale GPU deployments
  4. Maintain deep technical expertise across the customer's infrastructure stack — GPU compute, high-speed fabric, and large-scale storage systems — advising on configuration, operational best practices, and incident resolution
  5. Own the observability strategy for the customer estate, including alert policy definition, dashboard development, and proactive health management across all infrastructure layers

Skills

Required

  • 5+ years in a customer-facing technical role
  • 2+ years in dedicated technical account management or solutions architecture for large-scale AI or HPC infrastructure
  • Deep expertise in GPU infrastructure — GPU health diagnostics, RMA workflows, and hardware acceptance testing
  • Hands-on experience with large-scale Ethernet and InfiniBand fabric architecture
  • Working knowledge of enterprise storage systems, including high-density NVMe, parallel file systems, and metadata infrastructure
  • Experience with DC operations, facilities coordination, and hosting provider SLA management
  • Strong ownership mindset for incident management, RCA authorship, and executive-level customer communication
  • Proficiency in infrastructure monitoring and observability tooling (Prometheus, Grafana, or equivalent)
  • Proven ability to manage multiple concurrent workstreams with hyperscaler-level rigor and communication standards

Nice to have

  • Proficiency in Python, Bash, or infrastructure automation tools

What the JD emphasized

  • dedicated technical account management
  • large-scale AI or HPC infrastructure
  • GPU infrastructure
  • large-scale GPU deployments
  • hyperscaler-level rigor