Senior Support Engineer - Tokyo

OpenAI OpenAI · AI Frontier · Tokyo, Japan · User Operations

Senior Support Engineer role focused on providing technical guidance and resolving complex issues for enterprise customers using OpenAI's API platform. The role involves designing and running operational processes, leveraging automation and AI to scale support, configuring monitoring and alerting, contributing to reliability reviews, refining incident response, and analyzing operational metrics. Requires strong troubleshooting, SRE/NOC, monitoring, incident response, scripting (Python), cloud infrastructure, and cross-functional collaboration skills.

What you'd actually do

  1. Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.
  2. Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.
  3. Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.
  4. In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.
  5. Design and refine incident response processes and documentation across strategic customers, engineering and support teams.

Skills

Required

  • Bachelor’s degree in Computer Science or a related field
  • 8+ years of experience in technical operations roles such as SRE/NOC
  • designing monitoring systems
  • resolving production issues in fast-paced and mission-critical environments
  • troubleshooting complex technical problems at the systems level
  • deep familiarity with modern monitoring, alerting, and observability practices
  • Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems
  • proven experience leading incident response for high‑severity outages or service disruptions
  • Able to perform real‑time incident coordination, root cause analysis, and drive follow‑ups
  • Knowledge of industry best practices for incident management and fault diagnosis
  • strong skills in scripting or software engineering (e.g., Python or similar)
  • solid understanding of cloud infrastructure and distributed systems fundamentals
  • Comfortable working with cloud services, load balancers, databases, and containerized applications
  • effective at working cross‑functionally in a high‑trust environment
  • Strong communication skills to explain technical issues and resolutions to both engineering and non‑technical stakeholders
  • coordinate efforts across teams
  • comfortable providing updates in the midst of an ongoing incident

Nice to have

  • strong software engineering foundation

What the JD emphasized

  • last line of defense
  • core Engineering team
  • scale support operations
  • advancements in AI technologies
  • customer impacting issues
  • strategic customer requirement updates
  • incident response processes
  • troubleshooting complex technical problems at the systems level
  • high-severity outages
  • service disruptions
  • real-time incident coordination
  • root cause analysis
  • incident management
  • fault diagnosis