Senior Software Engineer - Hpc

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +2

Senior Software Engineer focused on improving HPC infrastructure for AI applications, involving distributed systems, cloud operations, and reliability.

What you'd actually do

  1. Apply modern distributed systems patterns to push the limits of scale, latency, and reliability.
  2. Continuously improve infrastructure provisioning and operations with automation, APIs, and self‑service platforms.
  3. Operate in a globally distributed, hybrid multi‑cloud environment (AWS, GCP, on‑prem), building systems that are cloud‑native and location‑agnostic.
  4. Build strong cross-functional relationships and align with collaborators across various business units.
  5. Improve uptime and Quality of Service (QoS) through data-driven operations, strong SLOs, and robust incident practices.

Skills

Required

  • Go
  • Java
  • C/C++
  • Scala
  • Python
  • Elixir
  • backend
  • systems
  • infrastructure engineering
  • scalability
  • consistency
  • performance trade-offs
  • server-side systems
  • horizontally scalable
  • resilient
  • low-latency services
  • end-to-end service ownership
  • architecture
  • build reviews
  • implementation
  • testing
  • rollout
  • observability
  • iterative improvement
  • GCP
  • AWS
  • Azure
  • cloud-native primitives
  • CI/CD
  • GitOps workflows
  • Infrastructure as Code
  • problem-solving skills
  • simplifying complex systems
  • B.S. in Computer Science or related field
  • 5+ years of relevant experience
  • communication skills
  • collaboration skills
  • technical decision guiding

Nice to have

  • HPC clusters
  • large-scale AI/ML platforms
  • job schedulers
  • Slurm
  • Kubernetes
  • open source component maintainer

What the JD emphasized

  • core infrastructure or control planes for HPC clusters, large-scale AI/ML platforms, or systems managed by job schedulers