Manager, Software Architecture

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Manager for a systems and networking engineering team focused on building distributed AI communication systems (libraries, frameworks, system integrations) for GPUs, nodes, and storage. The role involves setting technical direction, leading execution, and fostering technical excellence within the team, with a focus on AI infrastructure problems.

What you'd actually do

  1. Lead and develop a team of systems and networking engineers building distributed AI communication systems—libraries, frameworks, and system integrations.
  2. Setting the technical roadmap in partnership with principal engineers and architects, balancing near-term delivery with long-term research bets.
  3. Creating a culture of technical excellence and open collaboration. Handling project planning, resource allocation, and delivery timelines across concurrent workstreams.

Skills

Required

  • 8+ overall years of software engineering experience
  • advanced knowledge in systems software, networking, or distributed systems
  • 3+ years of direct people management
  • BS, MS, PhD or equivalent experience in Computer Science, Computer Engineering, or a related field
  • Ability to scope a problem, set a plan, and deliver results in a fast-paced R&D environment
  • Strong communication skills
  • Good understanding of computer architecture, memory hierarchies, DMA engines, and networking
  • Proficiency in programming languages such as C, C++, Rust and Python
  • Understanding of ML systems concepts

Nice to have

  • Knowledge of ML inference frameworks (vLLM, SGLang, TensorRT-LLM) and their communication requirements
  • Familiarity with NVIDIA’s hardware and software ecosystem
  • Experience with agile methodologies adapted for engineering teams dedicated to research

What the JD emphasized

  • systems software
  • networking
  • distributed systems
  • ML systems concepts
  • transformer architectures
  • KV cache mechanics
  • model parallelism
  • distributed training and inference patterns

Other signals

  • AI infrastructure
  • distributed systems
  • GPU communication