Software Developer 5 , AI Infrastructure

Oracle Oracle · Enterprise · Seattle, WA +1

Software Developer 5 for Oracle Cloud Infrastructure (OCI) AI Infrastructure team, focusing on building and operating ultra-high-performance GPU platforms for AI/ML/HPC workloads. The role involves designing and developing fundamental architectural changes for GPU delivery, health monitoring, testing, triage automation, and diagnostic services. Responsibilities include owning software design and development for major components, diving deep into low-level systems, and working with bare metal hardware and orchestration frameworks. The team launches, configures, tests, and validates server platforms across OCI's fleet, interfacing with hardware and cloud services.

What you'd actually do

  1. own the software design and development for major components of Oracle's Cloud Infrastructure
  2. dive deep into any part of the stack and low-level systems to design broad distributed system interactions
  3. build high performance, scalable services and tooling that launch, configure, test, and validate server platforms across OCI’s massive fleet of Compute and GPU Infrastructure
  4. partner closely across other teams in Compute, Networking, Security, Data Center Engineering, and Hardware Development to ensure OCI can launch, scale, and maintain new server platforms with minimal operational overhead and high reliability
  5. Provide leadership and expertise in the development of new products/services/processes, frequently operating at the leading edge of technology

Skills

Required

  • BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience
  • Deep understanding of operating systems, computer networks, and high-performance applications
  • 6 plus years’ experience delivering and operating large-scale production systems (1000's server instances)
  • Proficient in multiple programming languages (java/python/c/c++/goLang/shell scripting)
  • Systematic problem-solving approach, strong communication skills, a sense of ownership, and drive
  • Proven ability to deliver products and experience with the full software development lifecycle
  • 8 or more years of software engineering or related experience

Nice to have

  • Strong background in Linux systems
  • Familiarity with system-level architecture, data synchronization, fault tolerance, and state management
  • General enterprise storage, networking, or computing experience
  • Experience with Server/GPU hardware architecture and system management
  • Experience with Infiniband or RoCE networking technologies
  • Hands-on experience designing, developing, and operating public cloud service data planes

What the JD emphasized

  • ultra-high-performance GPU platforms
  • AI/ML/HPC workloads
  • thousands of GPUs
  • distributed systems
  • Linux engineer
  • Systems triage experience
  • bare metal hardware
  • full-stack orchestration frameworks
  • Compute AI Infrastructure In-Band Engineering team
  • automated testing
  • hardware bring-up
  • benchmarking
  • debugging
  • NICs, SmartNICs, ILOMs, and GPUs
  • massive fleet
  • Compute and GPU Infrastructure
  • cutting edge GPU hardware
  • large-scale production systems (1000's server instances)
  • public cloud service data planes

Other signals

  • GPU platforms
  • AI/ML/HPC workloads
  • distributed systems
  • Linux engineer
  • bare metal hardware
  • full-stack orchestration
  • compute infrastructure
  • server platforms