Sr. Manager – Data & AI Support Engineering

Databricks Databricks · Data AI · TX · Support

Lead and manage a team of Technical Solutions Engineers focused on resolving complex customer issues across Databricks' Data & AI platforms, with a strong emphasis on building and scaling AI-enabled support workflows, agentic AI systems, and operational automations to improve issue resolution speed, platform reliability, and customer outcomes.

What you'd actually do

  1. Lead and manage a team of Technical Solutions Engineers responsible for driving deep technical resolutions for complex customer issues across Spark, AI/ML, Streaming, and Lakehouse platforms.
  2. Build AI-enabled support workflows and reusable automations to improve resolution speed and support quality.
  3. Use Agentic AI systems, logs, telemetry, observability platforms and internal systems to accelerate troubleshooting and root-cause analysis safely.
  4. Create reusable runbooks, prompts, and agentic workflows that scale operational efficiency across teams.
  5. Ensure strong AI governance, customer data safety, validation practices, auditability, and human-in-the-loop controls.

Skills

Required

  • 10+ years of experience designing, building, troubleshooting, and supporting large-scale Data & AI applications using Python, Java, Scala, Spark, or related distributed technologies.
  • Strong work experience of AI-enabled support workflows, agentic AI systems, Claude Skills workflows, RAG architectures, vector databases and any other operational automation frameworks.
  • Experience using AI tools for troubleshooting, root-cause analysis, observability analysis, and support workflow acceleration.
  • Strong hands-on expertise in Apache Spark, Spark SQL, Structured Streaming, Delta Lake, and distributed data processing systems.
  • Experience leading production-scale workloads across Big Data, Hadoop, AI/ML, Kafka, Streaming, Data Science, or Analytics platforms.
  • Strong troubleshooting and performance tuning experience for Spark and JVM-based distributed systems, including memory management, garbage collection, heap analysis, and thread dump analysis.
  • Hands-on experience with AWS, Azure, or GCP cloud platforms.
  • Proven experience managing globally distributed technical teams and handling high-severity customer escalations.
  • Strong analytical, debugging, problem-solving, and distributed systems troubleshooting skills.
  • Excellent written and verbal communication skills with strong customer-facing leadership abilities.
  • Strong organizational, multitasking, stakeholder management, and operational leadership capabilities.

Nice to have

  • Proven development/delivery experience at a production scale in Databricks tech stacks like Model serving, Lakehouse, Delta, DLT, Lakeflow, Lakebase platforms is a strong plus.

What the JD emphasized

  • AI-enabled support workflows
  • agentic AI systems
  • troubleshooting
  • root-cause analysis
  • operational efficiency
  • customer outcomes
  • AI governance
  • customer data safety
  • validation practices
  • auditability
  • human-in-the-loop controls

Other signals

  • AI-enabled support workflows
  • Agentic AI systems
  • troubleshooting and root-cause analysis
  • operational efficiency
  • customer outcomes