Senior Manager, AI Infrastructure Network Operations

Oracle Oracle · Enterprise · Austin, TX +1

Senior Manager to lead a team responsible for the development, operation, and improvement of large-scale RDMA network fabrics and supporting systems for AI infrastructure. Requires deep networking expertise (RDMA/RoCE, Clos fabrics, congestion control, telemetry) and software engineering experience, focusing on reliability, observability, and efficiency at cloud scale. The role involves managing engineers, driving operational and software engineering efforts, and collaborating with various teams to resolve escalations and improve performance.

What you'd actually do

  1. Manage and develop a team of engineers responsible for RDMA/RoCE fabric operations, performance, automation, and troubleshooting.
  2. Lead operational and software engineering efforts that improve the reliability, availability, observability, and performance of OCI AI/HPC networking fabrics.
  3. Apply deep networking knowledge, including RDMA, RoCE, Ethernet fabrics, congestion control, QoS, telemetry, and large-scale troubleshooting.
  4. Apply software architecture and development experience to guide the design, debugging, and enhancement of operational tools, automation platforms, monitoring systems, and infrastructure services.
  5. Drive improvements within existing software and network architectures, and identify opportunities to simplify, automate, and scale operational workflows.

Skills

Required

  • People leadership
  • managing engineers
  • RDMA/RoCE fabric operations
  • performance
  • automation
  • troubleshooting
  • reliability
  • availability
  • observability
  • performance of OCI AI/HPC networking fabrics
  • Ethernet fabrics
  • congestion control
  • QoS
  • telemetry
  • large-scale troubleshooting
  • software architecture
  • development experience
  • debugging
  • enhancement of operational tools
  • automation platforms
  • monitoring systems
  • infrastructure services
  • simplifying operational workflows
  • scaling operational workflows
  • customer escalations
  • NOC events
  • production incidents
  • technical investigation
  • networking
  • software
  • hardware
  • operations teams
  • team roadmaps
  • engineering efficiency
  • operational excellence
  • network performance
  • service availability
  • data-driven metrics
  • fabric health
  • operational backlog
  • customer impact
  • performance trends
  • business-critical service status
  • Network Availability
  • Network Automation
  • Network Monitoring
  • GNOC
  • deployment
  • hardware teams
  • operational planning
  • staffing
  • readiness
  • execution
  • corporate and service-level expectations
  • manager on-call rotation
  • high-severity incidents
  • attract engineers
  • mentor engineers
  • grow engineers
  • distributed systems experience

Nice to have

  • operating or building network for large-scale cloud
  • GPU/HPC networking
  • Clos fabrics
  • congestion management
  • performance debugging
  • leading teams that build automation
  • monitoring
  • remediation
  • infrastructure management systems
  • communication skills
  • work across engineering, operations, and customer-facing teams

What the JD emphasized

  • large-scale RDMA network fabrics
  • supporting systems
  • deep networking expertise
  • RDMA/RoCE
  • Clos fabrics
  • congestion control
  • telemetry
  • performance troubleshooting
  • software engineering experience
  • tools, automation, monitoring, and operational systems
  • global cloud scale
  • AI/HPC networking fabrics