Principal Infrastructure Engineer

Oracle Oracle · Enterprise · TOKYO, Japan

Principal Infrastructure Engineer focused on designing, deploying, and supporting large-scale GPU/HPC infrastructure and AI platforms on Oracle Cloud Infrastructure (OCI). The role involves pre-sales technical consulting, solution engineering for AI workloads, agentic systems, and robotic AI platforms, and customer enablement.

What you'd actually do

  1. Architect and deploy large-scale GPU/HPC infrastructure on OCI using tools like Terraform, Ansible, Slurm and Kubernetes.
  2. Build automated solutions for cluster provisioning, software deployment, and infrastructure as code.
  3. Collaborate with Oracle’s largest enterprise customers to define and tailor solutions that meet high-performance compute and AI requirements.
  4. Support LLM-based solutions, agentic AI systems, and robotic AI platforms from design through deployment.
  5. Act as a trusted technical advisor, guiding customers on best practices, cloud migration strategies, and deployment patterns.

Skills

Required

  • HPC infrastructure
  • GPU infrastructure
  • AI platform engineering
  • Scripting and automation (Python, Bash, PowerShell)
  • Terraform
  • Ansible
  • Kubernetes
  • Cluster managers (SLURM, PBS, Bright)
  • Container orchestration
  • RDMA
  • Infiniband
  • MPI
  • Distributed file systems
  • Cloud Native experience
  • AI/ML platforms
  • Large language models (LLMs)
  • Inference serving stacks
  • Pre-sales technical consulting
  • Solution architecture
  • Communication and presentation skills

Nice to have

  • Slurm
  • PowerShell
  • Oracle Cloud Infrastructure (OCI)
  • Bachelor’s or Master’s degree in Computer Science, Engineering, Mathematics, or related field
  • Thought leadership through publications, speaking engagements, or community contributions

What the JD emphasized

  • deep expertise in HPC, GPU infrastructure, and AI platform engineering
  • design and deploy large-scale accelerated computing solutions
  • lead customer engagements
  • drive adoption of cutting-edge AI workloads
  • architect and deploy complex HPC and GPU clusters, AI platforms, and intelligent agentic solutions
  • pre-sales technical consulting, solution engineering, and AI transformation strategy
  • deep technical skills
  • consultative approach
  • develop scalable AI architectures
  • large-scale GPU/HPC infrastructure
  • AI platforms
  • intelligent agentic solutions
  • LLM-based solutions, agentic AI systems, and robotic AI platforms
  • trusted technical advisor
  • best practices, cloud migration strategies, and deployment patterns
  • technical gaps
  • influence product roadmaps
  • key AI Partners
  • Hands-on expertise with GPU and HPC architecture
  • Proficiency in scripting and automation
  • Experience with cluster managers
  • Knowledge of RDMA, Infiniband, MPI, and distributed file systems
  • Core Cloud Native experience
  • Familiarity with AI/ML platforms, large language models (LLMs), and inference serving stacks
  • 5+ years in pre-sales, technical consulting, or customer-facing solution architecture
  • Strong communication and presentation skills
  • deliver innovative cloud solutions
  • translate complex technical capabilities into business-aligned strategies
  • design, deployment, and support of large-scale AI, GPU, and HPC infrastructure solutions
  • partners closely with customers throughout the entire engagement lifecycle
  • solution architecture and Proof of Concept (POC) through production deployment, optimization, and ongoing operational support
  • technical leadership
  • developing reusable assets, automation, reference architectures, and technical enablement content

Other signals

  • design and deploy large-scale accelerated computing solutions
  • drive adoption of cutting-edge AI workloads on Oracle Cloud Infrastructure (OCI)
  • architect and deploy complex HPC and GPU clusters, AI platforms, and intelligent agentic solutions
  • support LLM-based solutions, agentic AI systems, and robotic AI platforms from design through deployment
  • familiarity with AI/ML platforms, large language models (LLMs), and inference serving stacks