Engineering Manager - Agentic Systems

Moveworks Moveworks · Enterprise · Mountain View, CA · Machine Learning

Engineering Manager for Machine Learning Infrastructure team, responsible for leading the development, optimization, and scaling of the end-to-end systems for the entire ML/LLM lifecycle, including distributed training, inference, model evaluation, and LLM latency optimization. The role focuses on building foundational infrastructure for agentic AI experiences and supporting hundreds of production models.

What you'd actually do

  1. Lead, Mentor, and Grow a world-class team of ML and Systems Engineers, fostering a culture of innovation, ownership, and operational excellence that aligns with Moveworks' core principles.
  2. Own the Technical Vision and roadmap for the end-to-end ML platform that powers the entire lifecycle—from data synthesis and distributed training to ultra-low-latency inference and serving—for hundreds of production models, including our proprietary MoveLM series.
  3. Drive the Strategy for model performance and efficiency, making critical architectural decisions to optimize our GPU infrastructure for latency, throughput, and cost at massive scale.
  4. Partner with Leaders across agentic platform, search platform, product engineering, and core infrastructure teams to define and deliver the foundational infrastructure that will power the next generation of agentic AI experiences.
  5. Champion a Product Mindset for your platform, building powerful abstractions and tools that accelerate the velocity of machine learning engineers and researchers across the organization.

Skills

Required

  • Master's or Ph.D. in Computer Science, Machine Learning, or a related field
  • 5+ years of industry experience with a proven track record of leading or managing high-performing machine learning or infrastructure teams
  • Deep technical expertise in designing, building, and scaling end-to-end machine learning systems in production environments
  • Strong command of Python
  • experience with performant languages such as C++ or GoLang
  • Extensive experience with deep learning frameworks like PyTorch or Hugging Face
  • Hands-on experience with modern LLM infrastructure, including distributed training frameworks (e.g., Deepspeed) and inference/serving frameworks (e.g., vLLM, TensorRT-LLM, Kubernetes)
  • A strategic mindset with experience balancing the demands of operating robust, scalable infrastructure with the need for forward-looking research and development
  • Excellent communication and collaboration skills, with experience working cross-functionally to deliver complex projects

Nice to have

  • Experience working with Machine Learning products

What the JD emphasized

  • absolutely critical to the long-term scalability of our core AI product
  • end-to-end systems for the entire ML/LLM lifecycle
  • distributed training and inference
  • model evaluation frameworks
  • LLM latency optimization
  • ultra-low-latency inference and serving
  • foundational infrastructure that will power the next generation of agentic AI experiences

Other signals

  • leading a team
  • building and optimizing infrastructure
  • scaling ML/LLM lifecycle systems
  • distributed training and inference
  • LLM latency optimization
  • foundational infrastructure for agentic AI