Principal Machine Learning Engineer

Oracle Oracle · Enterprise · United States

This Principal Machine Learning Engineer role at Oracle Cloud Infrastructure (OCI) focuses on building state-of-the-art training infrastructure for massive GPU clusters and designing agentic systems for enterprise-scale deployment. The role involves the entire software and model development lifecycle, including training, fine-tuning, model serving, and evaluation, with a strong emphasis on distributed systems, cloud architecture, and scaling heterogeneous CPU/GPU training infrastructure for large multimodal frontier models.

What you'd actually do

  1. Design and develop AI software in Java, Python, and other languages.
  2. Participate in the entire software lifecycle – development, testing, CI/CD and production operations
  3. Participate in the entire model development cycle - training, fine-tuning, model serving, evaluation/benchmarking and human preference learning.
  4. Apply engineering principles for defining robust and maintainable architectures and designs.
  5. Build cloud service on top of the modern Infrastructure as Service (IaaS) building blocks at OCI

Skills

Required

  • BS/MS in Computer Science or equivalent experience
  • 6–10+ years building and shipping enterprise distributed or cloud-native systems
  • Strong foundation in system design, distributed systems, and cloud architecture best practices
  • Proficiency in Java, Python, or similar object-oriented languages
  • Experience building highly available services using service-oriented design patterns and service-to-service communication protocols
  • Proven ability to deliver impact in collaborative, fast-paced environments
  • Strong verbal and written communication skills, including technical design documentation
  • Hands-on experience with containers and orchestration technologies such as Kubernetes and Docker

Nice to have

  • Production experience with Cloud and ML technologies
  • Generative AI Modeling: Customizing LLM's, build and deploy LLM's at scale for large scale data generation
  • Algorithms: Transformer models, Attention mechanism, Prompt tooling

What the JD emphasized

  • Experience scaling heterogeneous CPU/GPU training infrastructure for large multimodal frontier models

Other signals

  • building state-of-the-art training infrastructure for massive GPU clusters
  • designing agentic systems deployed on OCI infrastructure at enterprise-scale
  • innovating on leading principles of agentic software development
  • driving the frontier on infrastructure that maximizes the potential of bleeding-edge GPU clusters
  • multi-modal data generation
  • leading the framework across Oracle