Senior AI Platform Engineer- Data and Systems

Adobe Adobe · Enterprise · San Jose, CA

Senior AI Platform Engineer focused on building the foundational infrastructure for AI, analytics, and autonomous agents at scale. The role involves designing and building streaming-first data pipelines, extending the ML Attribute Store with low-latency serving, developing Agent Data APIs and tool servers for autonomous AI agents, and creating an agentic framework for automation and self-healing pipelines. Experience with LLM integration, agentic AI frameworks, and cloud platforms is required.

What you'd actually do

  1. Design and build streaming-first data pipelines that collapse end-to-end latency from hours to minutes, through event-driven architectures.
  2. Own and extend the ML Attribute Store — building low-latency online serving capabilities alongside batch feature computation with unified batch/streaming aggregation to prevent training-serving skew.
  3. Build MCP-compatible Agent Data APIs and tool servers that make the lakehouse discoverable and queryable by autonomous AI agents through standardized protocols, semantic layers, and catalog-driven data discovery.
  4. Develop agentic framework — automated anomaly detection, duplicate event cleanup, transient event lifecycle management with audit trails, pipeline self-healing, and root cause analysis automation.
  5. Drive operational excellence: observability, incident detection and response automation, performance tuning, cost optimization, and on-call ownership for mission-critical platform services.

Skills

Required

  • 6+ years of experience in data platform engineering, distributed systems, or backend infrastructure at scale.
  • Deep hands-on experience with Apache Spark, Databricks, Delta Lake, or equivalent lakehouse technologies (Iceberg, Hudi).
  • Proven track record building and operating large-scale pipelines processing billions of events daily with sub-hour latency SLAs.
  • Strong experience with streaming systems: Kafka, Kinesis, Flink, Spark Structured Streaming, or Delta Live Tables.
  • Proficiency in Python and/or Scala; SQL fluency required.
  • Experience with cloud platforms (AWS or Azure), containerization (Docker, Kubernetes), and CI/CD for data pipelines.
  • Production experience integrating LLMs into engineering workflows — not prototypes, but systems running against real data with real users. Includes prompt engineering, tool-use/function-calling, structured output parsing, and context window management.
  • Hands-on experience with agentic AI frameworks and multi-agent orchestration (LangChain, LangGraph, CrewAI, AutoGen, or custom agent loops with memory, planning, and tool routing).
  • Understanding of MCP (Model Context Protocol) and/or A2A protocols for exposing platform capabilities as agent-consumable tool servers — or demonstrable ability to build equivalent agent-tool integration surfaces.
  • Experience building or operating ML Feature Stores (online and/or offline), including training-serving skew mitigation, feature freshness trade-offs, and real-time feature computation.
  • Familiarity with RAG architectures: embedding generation, vector databases (FAISS, Pinecone, Weaviate, Databricks Vector Search), document chunking strategies, and retrieval evaluation.
  • Exposure to semantic layers, knowledge graphs, or metadata-driven data discovery systems (Unity Catalog, DataHub, OpenMetadata) that enable agents to autonomously navigate enterprise data catalogs.
  • Ability to build evaluation and feedback pipelines for AI systems — measuring agent accuracy, latency, cost attribution per workflow, and reliability at scale.

Nice to have

  • Java or Go is a plus.
  • Demonstrated use of AI-powered developer tools (Claude Code, Cursor, GitHub Copilot, or similar) to accelerate engineering velocity.

What the JD emphasized

  • systems-first engineering role
  • agentic-first approach
  • production experience integrating LLMs into engineering workflows
  • agentic AI frameworks and multi-agent orchestration
  • MCP (Model Context Protocol) and/or A2A protocols
  • Agentic-first instinct

Other signals

  • building foundational infrastructure for AI and autonomous agents
  • evolving lakehouse into a streaming-first, self-healing, agent-ready platform
  • developing agentic framework for automation and self-healing pipelines
  • building MCP-compatible Agent Data APIs and tool servers
  • integrating LLMs into engineering workflows and agentic AI frameworks