Senior Principal Network Development Engineer (ic5) – Backend Nic Qualification & Npi (oci Ai2 – Performance & Nic Engineering)

Oracle Oracle · Enterprise · Austin, TX +1

Senior Principal Network Development Engineer focused on backend NIC qualification and NPI for AI superclusters. The role requires deep expertise in NIC architecture, distributed systems networking for AI/ML, and performance tuning to ensure RDMA performance, cluster scale, and workload isolation. Responsibilities include leading qualification strategies, defining validation methodologies, driving performance characterization, collaborating with vendors and internal teams, building automated validation frameworks, and establishing qualification gates for AI infrastructure.

What you'd actually do

  1. Own the end-to-end qualification strategy and execution for backend NICs supporting OCI AI clusters (RDMA/RoCE-based fabrics)
  2. Lead NIC NPI for AI infrastructure, from early silicon bring-up through fleet-wide deployment across OCI regions
  3. Define validation methodologies for high-performance, low-latency distributed training workloads (e.g., GPU collectives, east-west traffic patterns)
  4. Drive deep performance characterization and tuning of NICs in AI cluster environments (latency, throughput, tail latency, congestion behavior)
  5. Partner with NIC and silicon vendors (e.g., NVIDIA/Mellanox, Broadcom, Intel) to resolve complex hardware/firmware issues and influence feature design

Skills

Required

  • Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or related field
  • 8–12+ years of experience in networking, systems engineering, or hardware validation in large-scale distributed environments
  • Deep expertise in NIC architecture and advanced features (RDMA/RoCE, congestion control, SR-IOV, queueing, offloads)
  • Strong understanding of distributed systems networking for AI/ML workloads (e.g., collective communication patterns, east-west traffic scaling)
  • Advanced knowledge of Linux networking stack and kernel-level debugging
  • Proven experience leading hardware qualification and/or NPI efforts in data center or cloud environments
  • Strong debugging skills across hardware, firmware, driver, and system layers
  • Proficiency in automation and tooling (Python, Bash, or similar)
  • Experience with performance benchmarking and traffic analysis in high-scale environments

Nice to have

  • Experience with AI/HPC networking (e.g., RoCEv2, InfiniBand concepts, GPU cluster networking)
  • Familiarity with distributed training frameworks (e.g., NCCL) and their network behavior
  • Knowledge of PCIe, NUMA, and GPU/accelerator interconnect considerations
  • Experience in hyperscale cloud environments
  • Exposure to SmartNICs, DPUs, or offload-driven architectures
  • Experience building validation pipelines integrated with CI/CD systems
  • Background in large-scale cluster bring-up and production operations

What the JD emphasized

  • Deep expertise in NIC architecture and advanced features (RDMA/RoCE, congestion control, SR-IOV, queueing, offloads)
  • Strong understanding of distributed systems networking for AI/ML workloads (e.g., collective communication patterns, east-west traffic scaling)
  • Proven experience leading hardware qualification and/or NPI efforts in data center or cloud environments
  • Experience with performance benchmarking and traffic analysis in high-scale environments
  • Experience with AI/HPC networking (e.g., RoCEv2, InfiniBand concepts, GPU cluster networking)
  • Familiarity with distributed training frameworks (e.g., NCCL) and their network behavior
  • Experience building validation pipelines integrated with CI/CD systems

Other signals

  • enabling OCI’s AI superclusters
  • RDMA performance, cluster scale, and workload isolation
  • NIC technologies meet stringent requirements
  • leading hardware qualification and/or NPI efforts
  • performance benchmarking and traffic analysis in high-scale environments
  • AI/HPC networking
  • distributed training frameworks (e.g., NCCL)
  • GPU cluster networking
  • low-latency distributed training workloads
  • deep performance characterization and tuning of NICs in AI cluster environments
  • resolve complex hardware/firmware issues
  • optimal integration with drivers, kernel, and user-space stacks
  • automated validation frameworks for continuous qualification
  • root cause analysis (RCA) for systemic issues impacting cluster performance or reliability
  • qualification gates, acceptance criteria, and release readiness processes
  • production telemetry and workload insights
  • next-generation AI workloads