Sr Director of Software Engineering- AI Infrastructure Platform

JPMorgan Chase JPMorgan Chase · Banking · Palo Alto, CA +1 · Corporate Sector

Senior Director of Software Engineering for AI Infrastructure Platform at JPMorgan Chase, leading multiple departments to deliver a unified AI infrastructure layer across on-premises, public cloud, and accelerated-compute vendors. Owns training and experimentation on a Kubernetes-standardized platform, acting as a design partner for architectural trade-offs and ensuring reliable, secure, and operable systems at enterprise scale. Responsibilities include standardizing AI developer workflows, building platform APIs, improving GPU availability and utilization, defining scheduling and placement strategies, embedding security and controls, driving operational excellence, and leading engineering teams.

What you'd actually do

  1. Lead multiple technology and platform implementations across departments to deliver firmwide AI infrastructure objectives, with a primary focus on training and experimentation platforms operating at enterprise scale.
  2. Own the design, delivery, and evolution of a Kubernetes‑first training and experimentation platform, including Kubernetes‑native support for batch and distributed training jobs, lifecycle management, retry semantics, and failure recovery patterns.
  3. Standardize AI developer workflows for experimentation, enabling self‑service job submission, reusable templates and golden paths, reproducibility mechanisms, and consistent runtime behavior across hybrid deployment environments.
  4. Build and evolve platform APIs and automation, including Kubernetes controllers and operators where appropriate, to ensure the platform is safe, scalable, and easy to adopt across teams.
  5. Drive measurable improvements in GPU availability and utilization through reliability engineering, fleet readiness patterns, and accelerated capacity onboarding.

Skills

Required

  • 15+ years of engineering experience, including 8+ years of senior engineering leadership experience with responsibility for managing managers
  • Demonstrated experience delivering platform products (beyond foundational infrastructure) with strong adoption, reliability, and operational maturity
  • Experience developing and leading large, cross-functional engineering teams within highly matrixed and complex enterprise environments
  • Proven track record of leading complex initiatives supporting distributed system design, testing, and operational stability at scale
  • Deep hands-on expertise with Kubernetes-based platforms, including: Multi-tenancy, RBAC, admission control, and network policy, Multi-cluster operations, upgrades, and cluster lifecycle management, Controllers, operators (CRDs), and platform API design patterns
  • Experience supporting AI training and experimentation platforms, including: PyTorch and distributed training concepts such as scaling, orchestration, and failure modes, Ray or similar frameworks for distributed experimentation execution, Familiarity with Slurm or equivalent HPC or batch schedulers and core concepts such as queues, fair‑share, reservations, and preemption
  • Understanding of modern AI inference stacks (for example, vLLM) and how serving constraints—latency, throughput, batching, KV cache behavior, and GPU memory limits—influence training and experimentation platform design
  • Strong understanding of GPU infrastructure fundamentals, including NVIDIA ecosystem capabilities, health and telemetry signals, and scheduling and placement constraints
  • Extensive practical experience with cloud-native technologies and hybrid infrastructure environments spanning on‑premises and public cloud
  • Experience hiring, developing, coaching, and retaining high‑performing engineering talent

Nice to have

  • Experience operating large‑scale GPU fleets, including heterogeneous accelerator environments
  • Experience delivering hybrid AI platforms across on‑premises infrastructure, public cloud, and specialized accelerated‑compute vendors
  • Experience working at the code level within large‑scale distributed systems

What the JD emphasized

  • Kubernetes-first training and experimentation platform
  • GPU availability and utilization
  • 15+ years of engineering experience, including 8+ years of senior engineering leadership experience with responsibility for managing managers
  • platform products
  • Kubernetes-based platforms
  • AI training and experimentation platforms

Other signals

  • leading multiple technical areas and managing multiple departments
  • delivering a unified AI infrastructure layer
  • Kubernetes-standardized platform
  • enterprise scale
  • GPU availability and utilization
  • governance-based scheduling and placement strategies
  • enterprise-grade security, risk, and control requirements
  • operational excellence
  • SLIs and SLOs
  • SLOs
  • error budgets
  • incident management
  • capacity forecasting
  • end-to-end platform observability
  • senior leaders, stakeholders, and executives
  • competing priorities
  • complex initiatives
  • high-performing organization
  • strong engineering standards
  • scalable operating models
  • accountability
  • continuous improvement
  • diversity, opportunity, inclusion, and respect