Sr Staff Engineer - Core Infrastructure

Uber Uber · Consumer · New York, NY +2 · Engineering

Senior Staff Engineer to lead the technical strategy and evolution of Uber’s Core Infrastructure Platform, focusing on compute, foundations, and software networking. The role involves architecting for extreme scale, optimizing fleet utilization, scaling GPU pools for Generative AI, modernizing the data plane, and integrating AI-driven automation (AIOps) into the infrastructure.

What you'd actually do

  1. Own the technical vision to drive fleet-wide CPU utilization and unit-cost optimization through ARM adoption (targeting XM+ cores) and silicon diversity.
  2. Define the architecture for shared GPU pools and high-performance clusters to support 300x larger ranking models and Autonomous Vehicle data ingestion.
  3. Drive the convergence of Uber’s networking stack toward industry standards (Kubernetes, Envoy, CNI) while enhancing "SkyEdge" for active-active multi-cloud resilience.
  4. Lead the "100% Done-Done" initiative, ensuring every service follows standardized safe-deployment (Starship) and reaches 100% zero-trust authorization.
  5. Integrate AI-driven "Minions" and AIOps into the infrastructure to automate 80% of alerts and unlock thousands of years of developer productivity.

Skills

Required

  • 12+ years of software engineering experience
  • massive-scale distributed systems or infrastructure
  • Kubernetes internals
  • container runtimes
  • Linux kernel
  • cloud-native networking (Envoy, CNI, Service Mesh)
  • multi-cloud (AWS/GCP) architecture
  • Go, Java, or C++
  • lead 40+ person technical initiatives
  • influence VPs and GMs on infrastructure investment

Nice to have

  • optimizing software for ARM architecture
  • specialized AI hardware (GPUs/TPUs)
  • Kubernetes
  • CNCF projects
  • major infrastructure open-source communities
  • building self-healing infrastructure
  • using LLMs/ML to automate infrastructure operations and incident response
  • Zero-Trust Security
  • S2S/P2S security models
  • ransomware-resilient infrastructure
  • driving XXM+ in annual P&L savings
  • resource scheduling
  • operating systems
  • Linux kernel performance tuning
  • eBPF

What the JD emphasized

  • massive-scale ML workloads
  • Platform Engineering 2.0
  • extreme scale
  • scaling GPU pools for Generative AI
  • 300x larger ranking models
  • AI-driven "Minions" and AIOps
  • massive-scale distributed systems
  • petabyte-scale data processing
  • AIOps & Automation
  • LLMs/ML to automate infrastructure operations and incident response

Other signals

  • Scaling GPU pools for Generative AI
  • Define the architecture for shared GPU pools and high-performance clusters to support 300x larger ranking models and Autonomous Vehicle data ingestion
  • Integrate AI-driven "Minions" and AIOps into the infrastructure to automate 80% of alerts and unlock thousands of years of developer productivity
  • Experience building self-healing infrastructure or using LLMs/ML to automate infrastructure operations and incident response