Infrastructure Ops Engineer

Baseten · Data AI · San Francisco, CA · G&A

This role is for an Infrastructure Ops Engineer at Baseten, a company that provides inference infrastructure for AI companies. The engineer will manage the operational aspects of global infrastructure, focusing on hardware lifecycles, Kubernetes, and cloud-native tools. Key responsibilities include fleet maintenance, fulfilling customer capacity requests, improving system observability, orchestrating maintenance, documenting GPU-specific issues, and building automation to reduce manual intervention. The role acts as a bridge between customers, SRE, and infrastructure teams to ensure platform reliability and readiness for AI deployments.

What you'd actually do

  1. Fleet Maintenance: Manage daily node operations including tainting/untainting, node draining, and PVC repairs to ensure GPU fleet health and operational cost control
  2. GTM & Capacity Fulfillment: Partner with Sales and account teams to scope and fulfill customer capacity requests, translating complex timelines into concrete infrastructure actions and clear ETAs
  3. Process & Observability Engineering: Identify recurring gaps in the capacity lifecycle (intake, triage, comms) and drive fixes by defining lightweight processes and improving system observability
  4. Technical Orchestration: Act as the operational bridge between SRE and Infra teams, executing discrete changes and verifying system status during high-stakes maintenance windows
  5. Technical Documentation: Contribute to the internal knowledge base for GPU-specific issues (H100/A100/B200) to accelerate future incident resolution

Skills

Required

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field
  • 2+ years of professional work experience, ideally in a customer-facing technical role or as a junior SRE/Cloud Engineer
  • Strong familiarity with Kubernetes and the lifecycle of cloud-based container orchestration
  • Strong ownership mindset and attention to detail, demonstrated through fast detection, clear communication, and reliable follow-through
  • Demonstrated ability to communicate complex technical blockers clearly to both internal engineering teams and external vendors

Nice to have

  • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

What the JD emphasized

  • customer capacity requests
  • GPU fleet health
  • high-stakes maintenance windows
  • GPU-specific issues (H100/A100/B200)
  • customer disruption

Other signals

  • powers mission-critical inference
  • ship AI products
  • operational engine of our global infrastructure
  • technical customer success and infrastructure engineering
  • managing clusters
  • capacity strategies are translated into boots-on-the-ground results
  • project-manage the resolution of capacity puzzles
  • ensuring our platform remains reliable, observable, and ready for the next massive AI deployment
  • GPU fleet health
  • customer capacity requests
  • improving system observability
  • executing discrete changes and verifying system status during high-stakes maintenance windows
  • GPU-specific issues (H100/A100/B200)
  • Automation & Tooling
  • reduce manual intervention and shorten time-to-mitigation
  • GPU-specific intelligence (H100/B200) and market moves