Lead Software Engineer, Fleet Management - Dgx Cloud

NVIDIA NVIDIA · Semiconductors · Seattle, WA +2 · Remote

Lead Software Engineer for NVIDIA's DGX Cloud team, focusing on building foundational systems for high-performance GPU infrastructure. The role involves designing scalable cloud services, ingesting telemetry from AI datacenters, managing data pipelines, and optimizing cloud operations. Requires expertise in cloud infrastructure, distributed systems, and API design, with a focus on operational automation and reliability.

What you'd actually do

  1. Act as technical lead for a team of software engineers designing cloud services backed by databases and data warehouses.
  2. Design and develop RESTful APIs to ingest telemetry from AI datacenters.
  3. Build scalable cloud services for high-volume ingestion, processing, and storage of large datasets.
  4. Build and manage data pipelines for online and offline data storage.
  5. Collaborate across teams to codify business processes into scalable, self-measuring systems.

Skills

Required

  • Go or Python
  • cloud infrastructure (AWS, GCP, Azure, etc)
  • Docker
  • Kubernetes
  • high-scale distributed systems
  • RESTful APIs
  • PostgreSQL-compatible data stores
  • Linux operating systems

Nice to have

  • PhD degree
  • modern JavaScript frameworks (e.g., React, Angular, Next.js)
  • leading engineers to successful delivery and operations of high-performance cloud services at Internet scale
  • operating NVIDIA datacenter GPUs
  • debugging and problem-solving skills in distributed environments

What the JD emphasized

  • high-performance GPU infrastructure
  • scalable cloud services
  • AI datacenters
  • high-volume ingestion
  • large datasets
  • online and offline data storage
  • scalable, self-measuring systems
  • reliability and efficiency
  • quality and scalability
  • high-scale distributed systems
  • architectural patterns for APIs and data pipelines
  • complex operational challenges
  • delivering scalable and efficient cloud services
  • high-performance cloud services at Internet scale