Senior Staff AI Platform Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Staff AI Platform Engineer at NVIDIA responsible for building, supporting, and maintaining AI-native infrastructure for enterprise products. This role involves architecting and scaling LLM/ML infrastructure, designing observability for AI models, developing automation, and troubleshooting complex distributed systems. The engineer will also drive AI-assisted engineering practices and partner with product teams to deliver scalable AI solutions.

What you'd actually do

  1. Define and lead AI-native infrastructure roadmaps and cross-organizational initiatives.
  2. Architect and scale LLM/ML infrastructure across cloud-native clusters and on-premises hardware.
  3. Design and implement observability for infrastructure health and AI model performance.
  4. Build LLM-aware monitoring and leverage AI to improve incident response and reduce toil.
  5. Develop automation and tooling to ensure reliability, scalability, and developer self-services

Skills

Required

  • Python
  • systems language (C++, Go, or Rust)
  • distributed systems debugging expertise
  • building and scaling distributed systems
  • Kubernetes
  • bare-metal infrastructure
  • observability design (metrics, logging, tracing, AI quality signals)
  • operating AI/ML platforms
  • MLOps
  • model serving
  • GPU-accelerated environments
  • infrastructure and application security practices
  • identity/auth
  • network segmentation
  • supply chain security
  • vulnerability management in cloud-native environments
  • AI-assisted development tools
  • coding agents
  • data structures
  • algorithms
  • complexity analysis

Nice to have

  • AI/ML platforms (e.g., Hugging Face, Weights & Biases, NVIDIA NIM)
  • AI agents and LLM tooling to enhance observability, incident response, or developer productivity
  • artifact management
  • AI supply chain security
  • trusted model distribution systems
  • AI-specific threat models (OWASP Top 10 for LLMs, model poisoning, adversarial inputs)
  • FedRAMP, SOC 2, or other compliance frameworks
  • red-teaming or security evaluation of LLM systems
  • structured, automation-first approach
  • AI-first engineering practices

What the JD emphasized

  • deeply technical
  • Senior AI Platform Engineer
  • build, support, and maintain
  • AI-powered enterprise products
  • AI-native infrastructure roadmaps
  • Architect and scale LLM/ML infrastructure
  • observability for infrastructure health and AI model performance
  • LLM-aware monitoring
  • reduce toil
  • complex distributed systems
  • deep Kubernetes
  • AI/ML scaling challenges
  • AI-assisted engineering practices
  • AI-first culture
  • AI platform capabilities
  • 10+ years in cloud, platform, or SRE roles
  • proven distributed systems debugging expertise
  • Deep experience building and scaling distributed systems
  • Strong observability design
  • Hands-on experience operating AI/ML platforms
  • MLOps, model serving, and GPU-accelerated environments
  • infrastructure and application security practices
  • Practical use of AI-assisted development tools and coding agents
  • AI/ML platforms (e.g., Hugging Face, Weights & Biases, NVIDIA NIM)
  • AI agents and LLM tooling
  • AI supply chain security
  • trusted model distribution systems
  • AI-specific threat models
  • FedRAMP, SOC 2, or other compliance frameworks
  • security evaluation of LLM systems
  • structured, automation-first approach
  • AI-first engineering practices

Other signals

  • AI-native infrastructure roadmaps
  • LLM/ML infrastructure across cloud-native clusters and on-premises hardware
  • observability for infrastructure health and AI model performance
  • LLM-aware monitoring and leverage AI to improve incident response
  • automation and tooling to ensure reliability, scalability, and developer self-services
  • troubleshoot complex distributed systems, including deep Kubernetes and AI/ML scaling challenges
  • AI-assisted engineering practices and mentor engineers to foster an AI-first culture
  • translate AI platform capabilities into reliable, scalable solutions