Principal Engineer - AI Networking

Oracle Oracle · Enterprise · Seattle, WA +1

Principal Engineer focused on designing, implementing, and optimizing RDMA-based networking infrastructure critical for large-scale AI training and inference workloads. This role involves deep systems programming, high-performance networking, and distributed communication systems to enhance the performance and scalability of AI infrastructure.

What you'd actually do

  1. Design, develop, and optimize RDMA-based software components and services for large-scale AI infrastructure.
  2. Build and enhance collective communication frameworks, transport layers, and communication libraries used by distributed AI workloads.
  3. Develop congestion management, load balancing, resiliency, and failover capabilities for RDMA-based networks.
  4. Analyze and improve communication performance across networking, GPU, and software stacks.
  5. Design and implement scalable distributed systems supporting AI training and inference environments.

Skills

Required

  • RDMA software engineering
  • high-performance networking
  • distributed communication systems
  • systems programming
  • C/C++
  • Linux systems programming
  • debugging and optimizing performance-critical software systems
  • networking fundamentals
  • operating systems concepts
  • distributed systems concepts

Nice to have

  • collective communication frameworks (NCCL, RCCL, MPI, UCX, UCC, XCCL)
  • AI/ML infrastructure support
  • GPUDirect RDMA
  • GPU-aware communication technologies
  • congestion management
  • traffic engineering
  • network resiliency solutions
  • large-scale GPU clusters
  • high-performance computing environments
  • distributed training frameworks (PyTorch, DeepSpeed, Megatron-LM, TensorFlow, JAX)
  • Kubernetes
  • containers
  • cloud infrastructure platforms
  • performance profiling and benchmarking tools

What the JD emphasized

  • RDMA software engineer
  • high-performance networking
  • distributed communication systems
  • systems programming
  • large-scale AI training and inference workloads
  • performance and scalability challenges
  • RDMA technologies
  • RDMA-enabled software
  • distributed training environments

Other signals

  • large-scale AI training and inference workloads
  • RDMA-based software components and services
  • collective communication frameworks
  • distributed AI workloads
  • scalable distributed systems supporting AI training and inference environments