Senior Principal Engineer - AI Networking

Oracle Oracle · Enterprise · Seattle, WA +1

Senior Principal Engineer role focused on AI networking infrastructure, specifically RDMA and distributed communication systems for large-scale GPU clusters supporting AI training and inference. The role involves architecture, design, implementation, and performance optimization of software components.

What you'd actually do

  1. Architect and develop high-performance networking software for large-scale AI and HPC environments.
  2. Design and implement RDMA-based services and infrastructure that enable low-latency, high-throughput communication across GPU clusters.
  3. Drive the evolution of collective communication frameworks and transport layers used by distributed AI training and inference workloads.
  4. Develop congestion management, traffic engineering, load balancing, and resiliency mechanisms for large-scale RDMA networks.
  5. Optimize end-end communication performance across networking, GPU, and software stacks.

Skills

Required

  • RDMA technologies (RoCE, InfiniBand)
  • C/C++
  • distributed communication frameworks
  • transport protocols
  • operating systems
  • networking stacks
  • memory management
  • performance optimization
  • large-scale production systems troubleshooting
  • Linux systems
  • low-level systems programming

Nice to have

  • NCCL, RCCL, MPI, UCC, UCX, XCCL
  • AI infrastructure for distributed training and inference
  • GPU networking (GPUDirect RDMA, GPU-aware communication)
  • congestion management
  • adaptive routing
  • traffic shaping
  • network resiliency
  • large-scale GPU clusters
  • services over RDMA transports
  • PyTorch, DeepSpeed, Megatron-LM, TensorFlow, JAX
  • cloud infrastructure
  • large-scale production service deployment
  • Kubernetes
  • containerized environments
  • cloud-native infrastructure
  • highly available and performance-critical systems architecture

What the JD emphasized

  • deep expertise in RDMA technologies
  • strong track record of delivering production-grade infrastructure at scale
  • highly complex technical challenges spanning networking, distributed systems, and AI infrastructure
  • large-scale AI and HPC environments
  • low-latency, high-throughput communication across GPU clusters
  • distributed AI training and inference workloads
  • large-scale RDMA networks
  • end-to-end communication performance
  • scalable infrastructure solutions
  • large-scale distributed systems
  • mission-critical AI infrastructure
  • complex cross-functional initiatives
  • highly available and performance-critical systems

Other signals

  • AI infrastructure
  • distributed systems
  • networking
  • GPU clusters
  • RDMA