Senior Core Infrastructure Engineer

Oracle Oracle · Enterprise · Austin, TX +1

Senior Core Infrastructure Engineer for Oracle Cloud Infrastructure (OCI) AI Infrastructure, focusing on building and operating a high-performance GPU platform for AI/ML/HPC workloads. Responsibilities include designing architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services for large-scale distributed systems using technologies like RoCE and Infiniband.

What you'd actually do

  1. Designing and developing fundamental architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services.
  2. Running distributed AI/ML/HPC workloads across thousands of GPUs, leveraging technologies like RoCE and Infiniband.
  3. Diving deep into any part of the stack, as well as software debugging and low-level systems troubleshooting.
  4. Translating requests into prioritized work or features.

Skills

Required

  • 4+ years of backend software development experience
  • BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience
  • Proficient in Java language or similar object-oriented languages. (GoLang)
  • Experience with at least one scripting language (Python, Shell) for automating tasks, proof of concept work, or command line tools.
  • Strong working experience on Git Hands-on experience building /Bitbucket. and operational tools and dashboards

Nice to have

  • Hands-on experience developing services on a public cloud platform (e.g., AWS, Azure, Oracle)
  • Experience and understanding of multi-AD/AZ and regional data centers
  • Work with large distributed systems Building continuous integration/deployment pipelines with robust testing and deployment schedules
  • Experience working with internal customers and translating requests into prioritized work or feature.

What the JD emphasized

  • ultra-high-performance GPU platform
  • AI /ML/HPC workloads
  • scale from tens to thousands of GPUs
  • distributed AI/ML/HPC workloads
  • thousands of GPUs
  • RoCE and Infiniband
  • deep understanding of distributed systems and algorithms
  • low-level systems troubleshooting

Other signals

  • GPU platform for AI/ML/HPC
  • distributed systems
  • high-performance computing