Machine Learning Engineer, API Multicloud

OpenAI OpenAI · AI Frontier · San Francisco, CA · Applied AI

Machine Learning Engineer to build and improve AI systems for strategic partners adapting OpenAI models to cloud-native environments. This role involves post-training workflows, evaluation, data pipelines, model behavior, and API/infrastructure integration, focusing on customizing and deploying models safely and reliably.

What you'd actually do

  1. Partner with strategic customers and internal teams to define target model behaviors, diagnose failure modes, and translate real-world needs into training, evaluation, and system requirements.
  2. Build and scale production ML systems for model customization, post-training, and fine-tuning-as-a-service workflows.
  3. Investigate whether training and customization workflows are producing the intended outcomes, and identify changes to data, evaluation, training, or infrastructure that improve performance.
  4. Partner with backend and infrastructure engineers to integrate ML capabilities into AWS-native API environments.
  5. Feed learnings from partner deployments back into the platform by proposing and implementing improvements to post-training systems, tooling, APIs, and developer workflows.

Skills

Required

  • Machine Learning Engineering
  • Production AI Systems
  • Deep Learning
  • Transformer Models
  • PyTorch
  • TensorFlow
  • Supervised Fine-Tuning
  • Distillation
  • Preference Optimization
  • Reinforcement Learning
  • Post-training Techniques
  • Software Engineering Fundamentals
  • Data Structures
  • Algorithms
  • Systems Design
  • Python
  • Rust
  • Model Customization
  • Evaluation Systems
  • Data Pipelines
  • Distributed Systems
  • Cloud Infrastructure
  • Production ML Platform Tradeoffs
  • Model Behavior
  • APIs
  • Infrastructure
  • Collaboration
  • Ambiguity
  • End-to-end Ownership
  • Learning Agility

Nice to have

  • AWS
  • Kubernetes
  • Agents
  • Tool Use
  • Runtime Environments
  • AI Developer Platforms
  • Speech Models

What the JD emphasized

  • 7+ years of professional engineering experience
  • Strong ML engineering experience building, training, fine-tuning, evaluating, or deploying production AI systems
  • Familiarity with training and fine-tuning large language models, including methods like supervised fine-tuning, distillation, preference optimization, reinforcement learning, or other post-training techniques.
  • Strong software engineering fundamentals
  • Experience with model customization, evaluation systems, data pipelines, distributed systems, cloud infrastructure, or production ML platform tradeoffs.
  • Comfort moving quickly through ambiguity, owning problems end-to-end, and learning whatever is needed to get the job done.

Other signals

  • Extending API platform into strategic cloud environments
  • Enabling key API technologies in AWS-native environments
  • Bringing core developer and enterprise capabilities into cloud-native environments
  • Model customization / post-training as a service
  • New stateful runtime environments for agentic workloads
  • Production ML systems, developer platforms, model behavior, and large-scale infrastructure
  • Build and scale production ML systems for model customization, post-training, and fine-tuning-as-a-service workflows
  • Investigate whether training and customization workflows are producing the intended outcomes
  • Partner with backend and infrastructure engineers to integrate ML capabilities into AWS-native API environments
  • Feed learnings from partner deployments back into the platform
  • Proposing and implementing improvements to post-training systems, tooling, APIs, and developer workflows
  • Bring model improvements, training workflows, and evaluation best practices into production
  • Design systems that allow strategic partners and enterprise customers to safely customize OpenAI models for high-value use cases
  • Debug and improve complex systems spanning model behavior, training data, APIs, distributed infrastructure, and customer-facing product surfaces
  • Operate with high ownership in a 0→1 environment where requirements are ambiguous, systems are evolving quickly, and reliability matters