What you'd actually do

Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.

Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.

Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.

In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.

Design and refine incident response processes and documentation across strategic customers, engineering and support teams.

Skills

Required

Bachelor’s degree in Computer Science or a related field
8+ years of experience in technical operations roles such as SRE/NOC
designing monitoring systems
resolving production issues in fast-paced and mission-critical environments
troubleshooting complex technical problems at the systems level
deep familiarity with modern monitoring, alerting, and observability practices
Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems
proven experience leading incident response for high‑severity outages or service disruptions
Able to perform real‑time incident coordination, root cause analysis, and drive follow‑ups
Knowledge of industry best practices for incident management and fault diagnosis
strong skills in scripting or software engineering (e.g., Python or similar)
solid understanding of cloud infrastructure and distributed systems fundamentals
Comfortable working with cloud services, load balancers, databases, and containerized applications
effective at working cross‑functionally in a high‑trust environment
Strong communication skills to explain technical issues and resolutions to both engineering and non‑technical stakeholders
coordinate efforts across teams
comfortable providing updates in the midst of an ongoing incident

Nice to have

strong software engineering foundation

What the JD emphasized

last line of defense

core Engineering team

scale support operations

advancements in AI technologies

customer impacting issues

strategic customer requirement updates

incident response processes

troubleshooting complex technical problems at the systems level

high-severity outages

service disruptions

real-time incident coordination

root cause analysis

incident management

fault diagnosis

About the Team

The Technical Support team is responsible for ensuring that developers and enterprises can reliably build mission critical solutions using OpenAI models. We provide technical guidance, resolve complex issues and support customers in maximizing value and adoption from deploying our highly-capable models. We work closely with Technical Success, Product, Engineering and others to deliver the best possible experience to our customers at scale. We think from an automation-first mindset and leverage the latest in AI to scale our support operations. Join the Senior Support Engineering (SSE) team at OpenAI and help shape the future of Technical Support in the age of AI.

About the Role

We are looking for a Senior Support Engineer to collaborate directly with our strategic enterprise accounts and product teams, helping solve some of the most difficult problems faced by our Customers. You will be part of the best technical troubleshooting team at OpenAI, and our Customers and Engineering teams will look to you for technical guidance in addressing the most technically difficult issues in our environment.

As a Senior Support Engineer, you will design and run operational processes to monitor our top strategic customers and a 24x7 response team. You’ll work closely with our Infrastructure and Engineering teams to deliver the best possible experience to customers at scale. Working directly with our most strategic Customers - You will be crucial to the success of the most innovative, disruptive, and high-scale AI solutions being built with the OpenAI API platform.

The nature of this role will be low volume, high difficulty.

This role is based in Tokyo, Japan. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:

Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.
Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.
Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.
In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.
Design and refine incident response processes and documentation across strategic customers, engineering and support teams.
Analyze operational metrics and incident RCAs to identify areas for improvement. Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows.
Provide support coverage during holidays and weekends based on business needs.

You might thrive in this role if you:

Have a Bachelor’s degree in Computer Science or a related field. A strong software engineering foundation is important for this role’s success.
Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments. A strong track record of troubleshooting complex technical problems at the systems level.
Have deep familiarity with modern monitoring, alerting, and observability practices. Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems (e.g., understanding of SLIs/SLOs, alert tuning, dashboard creation).
Have proven experience leading incident response for high‑severity outages or service disruptions. Able to perform real‑time incident coordination, root cause analysis, and drive follow‑ups (post‑mortems, action items) to prevent recurrence. Knowledge of industry best practices for incident management and fault diagnosis.
Have strong skills in scripting or software engineering (e.g., Python or similar) to automate repetitive tasks and integrate tools.
Have solid understanding of cloud infrastructure and distributed systems fundamentals. Comfortable working with cloud services, load balancers, databases, and containerized applications.
Are effective at working cross‑functionally in a high‑trust environment. Strong communication skills to explain technical issues and resolutions to both engineering and non‑technical stakeholders. You can coordinate efforts across teams and are comfortable providing updates in the midst of an ongoing incident.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.

Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for US-based candidates. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.

To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.

OpenAI Global Applicant Privacy Policy

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.