Ic Agentic Engineering Manager - Stargate

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

This role focuses on applying agent-based systems to infrastructure delivery and operations within OpenAI's Stargate Infrastructure team. The goal is to use AI-driven approaches for deployment orchestration, system bring-up, issue triage, debugging, and capacity management, rather than building general-purpose agent platforms. It's a player-coach role involving direct system design and implementation while leading a small team.

What you'd actually do

  1. Design and build agent-based systems to support infrastructure deployment and operations
  2. Identify high-impact opportunities to apply agents across workflows such as:
  3. cluster bring-up and deployment readiness
  4. incident triage and root cause analysis
  5. system validation and health monitoring
  6. capacity management and operational decision-making
  7. Lead a small team while contributing directly as an IC across system design, development, and integration
  8. Partner with infrastructure, hardware, and networking teams to integrate agentic systems into production workflows
  9. Develop systems that leverage telemetry, logs, and system signals to enable closed-loop automation
  10. Define evaluation frameworks to measure system effectiveness, reliability, and operational impact
  11. Drive iteration from prototype to production, ensuring robustness and scalability

Skills

Required

  • Strong software engineering background in distributed systems, infrastructure, or platform engineering
  • Experience building production automation systems or data-driven operational tooling
  • Experience applying AI, ML, or agent-based approaches to real-world systems or workflows
  • Ability to operate as a hands-on IC while leading a small team
  • Experience working cross-functionally with infrastructure, hardware, or systems teams
  • Strong problem-solving skills in complex, ambiguous environments

Nice to have

  • Experience with LLM-based systems, agents, or autonomous workflows
  • Background in infrastructure operations, SRE, or large-scale system deployment
  • Experience working on cluster bring-up, debugging, or data center infrastructure systems
  • Familiarity with telemetry, monitoring systems, and observability pipelines
  • Experience building internal tools or platforms for engineering productivity and operations

What the JD emphasized

  • applying agent-based approaches to real-world systems or workflows
  • Experience with LLM-based systems, agents, or autonomous workflows

Other signals

  • applying AI-driven approaches to real-world infrastructure workflows
  • applying agentic systems to solve concrete infrastructure problems
  • develop systems that leverage telemetry, logs, and system signals to enable closed-loop automation