Principal Software Developer - AI Infra Compute

Oracle Oracle · Enterprise · Austin, TX +1

The Principal Software Developer will focus on building and operating a high-performance GPU platform for AI/ML/HPC workloads within Oracle Cloud Infrastructure. This role involves designing and developing architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services to support large-scale distributed AI/ML/HPC workloads.

What you'd actually do

  1. Designing, implementing, and delivering software, firmware for managing GPU based AI servers
  2. Working closely with partner teams to deliver high quality software to manage, triage and repair GPU systems
  3. Working closely with product teams to debug, resolve customer's issues.

Skills

Required

  • BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience
  • Deep understanding of operating systems, computer networks, and high-performance applications
  • 6+ years’ experience delivering and operating large-scale production systems (1000+ server instances)
  • Proficient in one programming language(java/python/c/c++/goLang/shell scripting)
  • Systematic problem-solving approach, strong communication skills, a sense of ownership, and drive.
  • Proven ability to deliver products and experience with the full software development lifecycle

Nice to have

  • Strong background in Linux systems
  • Familiarity with system-level architecture, data synchronization, fault tolerance, and state management.
  • General enterprise storage, networking, or computing experience
  • Experience with Server/GPU hardware architecture and system management.
  • Experience with Infiniband or RoCE networking
  • Hands-on experience designing, developing, and operating public cloud service data planes
  • Good understanding of databases and SQL (MySQL) and caching technologies (Redis, Memcache etc)

What the JD emphasized

  • large-scale production systems (1000+ server instances)
  • full software development lifecycle

Other signals

  • GPU platform for AI/ML/HPC
  • distributed systems
  • high-performance GPU platform