Senior Member Technical Staff (joinoci-sde)

Oracle Oracle · Enterprise · Austin, TX +1

This role focuses on designing and developing large-scale distributed software services and solutions to manage AI infrastructure for Oracle Cloud Infrastructure (OCI). The primary goal is to ensure the reliability, performance, and availability of GPU clusters, which are crucial for AI workloads. While the company is heavily involved in AI, this specific role is centered on the engineering and infrastructure management aspects rather than direct AI model development or research.

What you'd actually do

  1. Design and develop large-scale distributed software services and solutions to manage AI infrastructure of OCI.
  2. Write high quality and maintainable code by leveraging design reviews, code reviews, unit tests and integration tests.
  3. Develop complete solutions by ensuring that the services and the components are well-defined and modularized, secure, reliable, diagnosable, actively monitored, compliant and reusable.
  4. Focus on customer needs through a data driven approach.
  5. Collaborate with other team members working on the same project to meet customer requirements.

Skills

Required

  • 3+ years of experience in software development with programming languages including, but not limited to, C, C++, C#, Java, Go, Rust.
  • 1+ year of experience designing and developing distributed systems and services.
  • Strong problem-solving and troubleshooting skills, with the ability to analyze complex systems and identify areas for improvement.
  • Excellent communication and collaboration skills, with the ability to work effectively in cross-functional teams.

Nice to have

  • Experience in managing cloud infrastructure with hundreds of thousands of servers.
  • Experience in containerization technologies such as Docker and Kubernetes.
  • Experience in scheduling high-performance workloads on Kubernetes or Slurm.

What the JD emphasized

  • deliver trusted, fast health determinations and customer‑initiated diagnostics that reduce false positives for GPU clusters, prevent unnecessary node returns, increase capacity for customers, protect revenue, and improve uptime—by providing an OCI‑supported safe diagnostic experience