Sr. Applied AI Engineer

Zapier Zapier · Enterprise · NAMER · Engineering

Zapier is seeking a Sr. Applied AI Engineer to build and evolve shared AI Platform capabilities, focusing on LLM Ops and ML Ops. This role involves creating common foundations for AI/ML development across the company, including LLM proxy servers, observability tooling, and ML Ops platform capabilities. The engineer will improve how models are accessed, monitored, evaluated, deployed, governed, and operated in production, defining standards and best practices for building AI-driven experiences.

What you'd actually do

  1. Build and evolve shared AI Platform capabilities that serve as the foundation for teams building with machine learning and generative AI across Zapier.
  2. Improve our LLM Ops and ML Ops capabilities, including observability, monitoring, evaluation, deployment workflows, and operational guardrails.
  3. Design and implement systems that help teams measure and improve the performance, reliability, safety, and cost efficiency of AI-powered experiences.
  4. Proactively identify tooling gaps and work across teams to standardize best practices for building, deploying, and monitoring AI-driven experiences.
  5. Evaluate emerging tools, models, and patterns in the AI ecosystem, and help determine which ones should be incorporated into Zapier’s shared platform.

Skills

Required

  • 7+ years of experience in software engineering
  • 3+ years of experience building distributed, scalable, cloud-based ML/AI systems in production
  • 2+ years of experience in LLM Ops, ML Ops, or adjacent platform/infrastructure work
  • Experience building shared services, internal platforms, or reusable developer tooling
  • Experience working through the full lifecycle of building, testing, deploying, and scaling ML/ LLM architectures
  • Experience building with cloud infrastructure technologies
  • TypeScript
  • Python
  • modern backend practices

Nice to have

  • Comfort with typed languages

What the JD emphasized

  • LLM Ops
  • ML Ops
  • building shared services, internal platforms, or reusable developer tooling
  • full lifecycle of building, testing, deploying, and scaling ML/ LLM architectures
  • measure and improve the performance, reliability, safety, and cost efficiency of AI-powered experiences

Other signals

  • LLM Ops
  • ML Ops
  • platform engineering
  • developer experience
  • internal tooling