System Software Engineer, Distributed Systems

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

System Software Engineer role focused on building tools and platforms for chip design engineers, emphasizing distributed systems and operational excellence in a bare-metal Linux environment. The role involves designing and developing core components for productivity platforms, reliable user-space infrastructure for long-running workflows, state coordination, and modernizing legacy codebases.

What you'd actually do

  1. Design, build, and deliver core components of our next-generation productivity platforms
  2. Develop reliable userspace infrastructure for long-running engineering workflows at scale on bare-metal Linux hosts
  3. Build state coordination over NFS (atomicity, idempotency/dedup, partial-write recovery, without privileged ops)
  4. Build and improve orchestration around IBM LSF (submission/tracking, retries/cancel, log capture, fairness/backpressure)
  5. Convert legacy codebases into modern powerhouses using incremental migration techniques (e.g., Perl to Go), with stage gates, parity strategies, and strong observability

Skills

Required

  • Go
  • Python
  • Linux fundamentals
  • distributed systems
  • production software development
  • operational rigor
  • batch schedulers
  • build systems

Nice to have

  • NFS
  • batch job scheduling
  • shared compute fleets
  • incremental modernization
  • metadata-heavy systems optimization
  • incident/debug tactics
  • LLM-generated code comprehension

What the JD emphasized

  • 5+ years developing and operating production software in Go and/or Python, ideally in large codebases
  • Strong Linux fundamentals: processes, filesystems, permissions, synchronization/locks, concurrency, and debugging
  • Solid distributed-systems thinking: failures, retries/timeouts, backoff, idempotency, and operational rigor
  • Experience building long-runtime automation or services on shared compute clusters (batch schedulers, build systems)
  • Ability to translate ambitious, high-level goals into a safe delivery plan (instrumentation, staged rollout, measurable outcomes)