Engineering Manager, Data Infrastructure

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Information Technology

CoreWeave is seeking an Engineering Manager for their Data Infrastructure team. This role involves leading a team of Software Engineers and Site Reliability Engineers responsible for the performance, reliability, scalability, and security of the company's data platform, including ingestion, transformation, analytics, and AI workloads. The manager will own core systems like compute engines, orchestration frameworks, and storage layers, partnering with various engineering teams to ensure a robust and secure platform. The role requires strong leadership and technical expertise in data platform infrastructure, Kubernetes, and distributed systems.

What you'd actually do

  1. lead a team of Software Engineers and Site Reliability Engineers responsible for the infrastructure that powers CoreWeave’s data platform
  2. own the reliability, scalability, and performance of core systems such as compute engines, orchestration frameworks, and storage layers
  3. partner closely with Data Engineering teams, as well as cross-functional groups including Production Engineering, Developer Experience, Security Engineering, and IT Operations to ensure the platform is robust, secure, and easy to operate
  4. balances people leadership with deep technical ownership, including stepping in hands-on when needed to support critical initiatives
  5. setting team goals and metrics (e.g., OKRs) and holding teams accountable to outcomes

Skills

Required

  • 7+ years of experience in software engineering, infrastructure engineering, or data platform engineering roles
  • 2+ years of experience managing engineering teams, including hiring, coaching, performance management, and career development
  • Experience leading teams through the full software development lifecycle (SDLC), including planning, execution, and delivery of complex technical initiatives
  • Experience running and evolving engineering processes (e.g., agile development, backlog management) to drive predictable execution and continuous improvement
  • Experience setting team goals and metrics (e.g., OKRs) and holding teams accountable to outcomes
  • Strong hands-on experience operating and scaling data platform infrastructure (e.g., Spark, Airflow, Iceberg, StarRocks) in production environments
  • Deep expertise in Kubernetes and containerized software development, including cluster design, operations, and scaling in production environments
  • Experience building and operating distributed systems with high availability and performance requirements, including SLOs and incident management
  • Strong understanding of data platform architecture (compute, orchestration, storage) and experience driving reliability, performance, and cost optimization at the platform level
  • Ability to contribute code and technical solutions when needed, with proficiency in at least one programming language (Python, Java, Go, Rust)
  • Experience partnering with cross-functional engineering teams (e.g., Production Engineering, Developer Experience, Security, IT) and data engineering teams to deliver cohesive platform solutions

Nice to have

  • Experience supporting high-scale data workloads (e.g., large-scale Spark clusters, real-time ingestion platforms)
  • Experience working in environments with strict uptime and reliability requirements (e.g., ≥99.99% uptime)
  • Experience working in regulated environments with compliance frameworks such as GDPR, SOC 2, HIPAA, or SOX
  • Experience building internal platforms that enable self-service analytics or developer productivity

What the JD emphasized

  • production-grade discipline
  • stringent uptime requirements
  • high availability and performance requirements
  • SLOs and incident management
  • regulated environments with compliance frameworks such as GDPR, SOC 2, HIPAA, or SOX