Senior Dataflow Development Engineer - Lpu

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Develop, build, and improve dataflow systems at the hardware–software boundary, focusing on the interactions between runtime and accelerator, implementing and tuning dataflow pipelines, creating host-side drivers and runtimes, and jointly inventing hardware and software for deterministic, low-latency execution. Implement dataflow graphs and streaming pipelines in hardware, build efficient host–device interfaces, and collaborate with compiler and architecture teams to map high-level dataflow onto FPGA and accelerator fabrics. The work directly affects latency, efficiency, and resource usage for inference at scale.

What you'd actually do

  1. Build and implement dataflow pipelines and streaming architectures.
  2. Develop host-side software, drivers, and runtimes that collaborate with our accelerator hardware (e.g. PCIe, DMA, VFIO || FPGA/LPU/GPU).
  3. Partner with compiler and hardware groups to allocate dataflow graphs onto hardware resources; improve latency, processing efficiency, and area/utilization.
  4. Build and maintain hardware–software co-design flows: from high-level dataflow specs to synthesis, place-and-route, and validation.
  5. Build tooling and methodologies for debugging, profiling, and validating dataflow behavior in hardware; participate in design reviews and cross-team alignment across EMEA and globally.

Skills

Required

  • FPGA development
  • RTL/HDL (Verilog, VHDL) or high-level synthesis (HLS)
  • C/C++ for host drivers, runtimes, or tooling
  • hardware interfaces (e.g. PCIe, DMA, memory-mapped I/O)
  • dataflow and streaming concepts

Nice to have

  • FPGA dataflow for machine learning inference
  • networking
  • high-throughput streaming
  • Xilinx/AMD FPGA
  • Intel FPGA
  • FPGA toolchains (synthesis, P&R, timing closure)
  • Linux
  • scripting
  • version control
  • VFIO, SR-IOV, or other pass through/virtualization for accelerators
  • low-level driver or BSP development
  • ASIC or custom-silicon dataflow build
  • RTL developed for dataflow or network-on-chip (NoC)
  • compiler backends
  • MLIR or IR-level optimization for hardware mapping
  • multi-FPGA or FPGA–GPU systems
  • distributed dataflow across programmable logic and accelerators

What the JD emphasized

  • hardware–software boundary
  • deterministic, low-latency execution
  • inference at scale
  • proven hardware approach
  • FPGA development
  • hardware/software co-design
  • RTL to runtime
  • dataflow architectures in silicon and programmable logic
  • BS or higher degree or equivalent experience in CS/EE/CE with more than 12 years in FPGA development, or hardware dataflow, or hardware/software co-design
  • Hands-on experience with RTL/HDL (Verilog, VHDL) or high-level synthesis (HLS)
  • build and debug dataflow-style pipelines in hardware
  • hardware interfaces (e.g. PCIe, DMA, memory-mapped I/O)
  • dataflow and streaming concepts: pipelining, backpressure, buffering, and resource/area trade-offs