Sr. ML Platform Engineer (hybrid)

CrowdStrike CrowdStrike · Enterprise · Bangalore, India

Senior ML Platform Engineer focused on reliability and debugging of distributed ML infrastructure (Ray, Spark, Airflow, MLflow, Kubeflow, SLURM) for training and inference pipelines, optimizing performance, and building observability solutions.

What you'd actually do

  1. Diagnose and resolve issues across Ray, Spark, Airflow, MLflow, JupyterHub, Kubeflow, and SLURM
  2. Perform root cause analysis on production incidents affecting training and inference pipelines
  3. Debug performance bottlenecks, resource contention, memory leaks, and scheduling conflicts
  4. Develop debugging tools and diagnostic frameworks
  5. Profile and optimize Ray clusters and Spark jobs on K8s and Cloud (EMR/Dataproc)

Skills

Required

  • 12+ years in distributed systems engineering
  • 5+ years debugging ML platforms in production
  • Deep expertise in 3+ one of: Ray, Spark, JupyterHub, SLURM, K8
  • Performance profiling, optimization, and capacity planning
  • Expert Python debugging
  • multi-language proficiency
  • Linux/Unix

Nice to have

  • Open-source ML infrastructure contributions
  • Experience with high-throughput inference systems and reducing MTTR
  • Published debugging guides or tools
  • Chaos engineering and GPU/CUDA debugging experience
  • On-call and incident management experience

What the JD emphasized

  • debugging ML platforms in production
  • Performance profiling, optimization, and capacity planning
  • Expert Python debugging

Other signals

  • ML infrastructure
  • distributed systems
  • production debugging
  • performance optimization