Senior Software Engineer, Infrastructure Automation and Distributed Systems

NVIDIA NVIDIA · Semiconductors · SC +4 · Remote

NVIDIA is seeking experienced Systems Engineers and Software Engineers to build and run reliable large-scale infrastructure platform services, ensuring internal and external facing EDA services atop NVIDIA hardware are running reliably. The role involves designing, deploying, and managing infrastructure services, defining SLOs and error budgets, eliminating toil through automation, practicing incident response, and consulting on systems design best practices.

What you'd actually do

  1. Design, build, deploy, and run infrastructure services & manage the software life cycle in scope to meet our business goals.
  2. Participate in the definition of our internal facing service level objectives and error budgets as part of our overall observability strategy.
  3. Eliminate toil or automate it where the ROI of building and maintaining automation is worth it.
  4. Practice sustainable blameless incident prevention and incident response while being a member of an oncall rotation.
  5. Consult with and provide consultation for peer teams on systems design best practices.

Skills

Required

  • BS degree in Computer Science or related technical field or equivalent experience
  • 12+ years of relevant experience
  • Infrastructure automation
  • Distributed systems design
  • Python
  • Go
  • Perl
  • Ruby
  • Linux
  • Networking
  • Storage
  • Containers

Nice to have

  • Experience accelerating positive impact to the business using coding assistant(s), MCP servers, or AI agents
  • Bare metal as a service (BMaaS)
  • Multi-cloud infrastructure services
  • Kubernetes
  • OpenStack
  • Docker
  • Slurm
  • Teaching reliability (SRE)
  • NVIDIA Collective Communication Library (NCCL)

What the JD emphasized

  • 12+ years of relevant experience
  • Experience with infrastructure automation and distributed systems design developing tools for running large scale private or public cloud system in production.