Ncx Engineer, AI Accelerator

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

This role focuses on engineering and deploying AI infrastructure and solutions for strategic customers, optimizing large-scale training and inference workloads on NVIDIA's AI platform. It involves MLOps, Kubernetes, GPU scheduling, and performance tuning, with a strong emphasis on customer-facing technical support and collaboration.

What you'd actually do

  1. Build and deploy custom AI solutions on NCP and Neo Cloud platforms, including distributed training, inference optimization, and MLOps pipelines constructed on NVIDIA reference architectures.
  2. Act as the main technical contact for strategic NCPs, offer remote and on-site support, troubleshoot complex production problems, and guide partner engineering teams on NVIDIA platform guidelines.
  3. Deploy and manage AI workloads across DGX Cloud, NCP data centers, and major CSP environments using Kubernetes, containers, and GPU scheduling systems aligned to NCP builds.
  4. Profile and tune large-scale training and inference workloads on NCP platforms. Implement observability and SLO/SLA monitoring. Lead detailed efforts to reduce latency, cost, and operational risk.
  5. Build detailed implementation guides, runbooks, and post‑mortem documentation that codify standard methodologies for running NVIDIA AI workloads at scale on NCP platforms.

Skills

Required

  • Linux systems
  • distributed computing
  • Kubernetes
  • containers
  • GPU scheduling
  • Python
  • Go
  • PyTorch
  • TensorFlow
  • customer facing technical roles
  • Solutions Engineering
  • DevOps
  • Site Reliability
  • ML Infrastructure Engineering
  • large-scale cloud or service provider environments
  • AI/ML experience supporting large-scale training and inference workloads
  • collaboration with customer and partner engineering teams
  • technical presentation skills

Nice to have

  • NVIDIA ecosystem (DGX systems, CUDA, NeMo, Triton, NIM)
  • NVIDIA networking (InfiniBand, RoCE)
  • NVIDIA Cloud Partners
  • hyperscale CSPs
  • managed AI cloud platforms
  • MLOps
  • cloud-native practices
  • containerization
  • CI/CD pipelines
  • observability stacks (Prometheus, Grafana, OpenTelemetry)
  • GitOps workflows
  • infrastructure as code (Terraform, Ansible)
  • integrating AI platforms with enterprise systems (Salesforce, ServiceNow)

What the JD emphasized

  • customer facing technical roles
  • large-scale training and inference workloads
  • production or critically important environments
  • collaborate with customer and partner engineering teams
  • guide intricate technical investigations
  • bring issues to root cause and resolution

Other signals

  • customer-facing technical roles
  • large-scale AI deployments
  • distributed systems
  • NVIDIA AI platform
  • advanced AI workloads
  • MLOps pipelines
  • Kubernetes
  • GPU scheduling
  • training and inference optimization
  • observability and SLO/SLA monitoring
  • reduce latency, cost, and operational risk
  • Python/Go
  • PyTorch or TensorFlow