Senior Software Engineer, AI Frameworks

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Senior Software Engineer to integrate NVIDIA Grove project into AI frameworks like Dynamo, Ray, and PyTorch, focusing on production-grade software for adoption, scaling, and operation. Responsibilities include building adapters, optimizing performance for distributed training/inference, and improving observability.

What you'd actually do

  1. Design and implement end-to-end integrations of Grove with open-source AI frameworks (e.g., Dynamo, llm-d, Ray, PyTorch, and related ecosystem projects).
  2. Build and maintain adapters, plugins, operators, and/or runtime components that enable Grove features to work smoothly across training and inference stacks.
  3. Partner with framework owners to upstream changes, contribute patches, and ensure long-term maintainability of integrations.
  4. Develop reference workflows, sample apps, and best-practice guides that accelerate adoption by users and partners.
  5. Optimize performance, scalability, and reliability for distributed training/inference, including multi-node and multi-GPU environments.

Skills

Required

  • BS/MS/PhD in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • 5+ years of proven experience in related field
  • Hands-on experience integrating with at least one major AI framework/runtime (e.g., PyTorch, Ray, Triton Inference Server ecosystem, distributed runtimes, model serving stacks).
  • Solid understanding of AI workloads: model development basics, training vs. inference tradeoffs, and performance considerations (throughput/latency, batching, memory).
  • Experience with distributed systems concepts (RPC, scheduling, fault tolerance, resource management).
  • Practical Kubernetes experience: deploying and operating services/jobs, Helm/Kustomize, operators/controllers (nice to have), and debugging clusters.
  • Familiarity with containers and cloud-native tooling (Docker, container registries, CI/CD pipelines).
  • Strong software engineering experience in Go, C++ and/or Python, with a track record of shipping reliable systems.
  • Strong interpersonal skills and ability to collaborate across teams and with open-source communities.
  • Exceptional collaboration, communication, and documentation habits.

Nice to have

  • operators/controllers
  • Open-source contributions to Dynamo, PyTorch, Ray, llm-d, Kubernetes ecosystem, or related ML infrastructure projects.
  • Experience with large-scale model serving, distributed inference, or multi-tenant AI platforms.
  • Experience building SDKs/APIs or developer tooling that improves integration usability.
  • Knowledge of GPU performance profiling and optimization (Nsight tools or similar), and/or kernel-level performance tuning.
  • Experience with reproducibility, packaging, versioning, and compatibility testing across fast-moving dependencies.

What the JD emphasized

  • proven experience in related field
  • Hands-on experience integrating with at least one major AI framework/runtime
  • Solid understanding of AI workloads
  • Experience with distributed systems concepts
  • Practical Kubernetes experience
  • Strong software engineering experience in Go, C++ and/or Python, with a track record of shipping reliable systems.

Other signals

  • integrating NVIDIA Grove project within Dynamo and across leading open-source AI frameworks
  • develop production-grade software enabling Grove capabilities to be adopted, scaled, and operated smoothly
  • build production-grade software that enables seamless adoption, scaling, and operation of Grove capabilities across environments such as Dynamo, llm-d, Ray, PyTorch, and other emerging frameworks in the AI ecosystem
  • collaborate across engineering teams and the open-source community to deliver robust integrations, reference implementations, and developer-focused tooling