Command Center Operations & Governance Specialist

Weights & Biases Weights & Biases · Data AI · Kenilworth, NJ, DC · Data Center - G&A

This role is focused on operations and governance within a global command center for a GPU cloud provider. The specialist will be responsible for defining, maintaining, and evolving operational processes, frameworks, and standards, including SOPs, escalation matrices, change management, incident management, and KPI tracking. It is not a technical tooling role but emphasizes operational discipline and cross-functional alignment to ensure the uptime and operational excellence of large-scale GPU clusters. The role requires experience in data center operations or mission-critical infrastructure and a proven track record in building and scaling operational frameworks.

What you'd actually do

  1. Strengthen, maintain, and govern all SOPs, MOPs, and EOPs across the Command Center, ensuring they are accurate, accessible, and consistently followed.
  2. Enhance and own the escalation framework, defining clear paths for incident triage, cross-functional coordination, and leadership notification.
  3. Lead change management governance, ensuring all maintenance activities and infrastructure changes follow safe, structured processes.
  4. Develop and manage shift structure, handover protocols, and staffing frameworks to ensure 24/7 operational continuity.
  5. Own the incident management lifecycle, from response coordination to post-incident reviews, RCA facilitation, and corrective action tracking.

Skills

Required

  • 5+ years of experience in data center operations, operations management, or mission-critical infrastructure in a 24/7 environment.
  • Proven track record of building and scaling operational frameworks, SOPs, escalation matrices, and change management in a high-growth environment.
  • Strong project and program management skills with the ability to drive cross-functional alignment.
  • Excellent written and verbal communication; able to translate complex operational requirements into clear, usable documentation.
  • Experience facilitating root cause analysis and driving corrective action to closure.
  • Comfortable working with operational metrics and reporting; able to build dashboards or work with data teams to do so.

Nice to have

  • Lean, Six Sigma, or other process improvement certification.
  • Experience in hyperscale, cloud, or AI infrastructure environments.
  • Background in training program development or operational enablement.
  • Familiarity with ITSM or ticketing platforms (Jira, ServiceNow) for workflow and change management.

What the JD emphasized

  • operational excellence
  • uptime
  • GPU clusters
  • escalation framework
  • change management
  • incident management
  • operational KPIs
  • data center operations
  • mission-critical infrastructure
  • 24/7 environment
  • building and scaling operational frameworks
  • high-growth environment
  • cross-functional alignment
  • root cause analysis