Engineering Manager, Agentic Systems - Moveworks

ServiceNow ServiceNow · Enterprise · Mountain View, CA +1 · Engineering

Engineering Manager for Machine Learning Infrastructure at Moveworks (acquired by ServiceNow), focusing on building and scaling the end-to-end ML/LLM lifecycle platform, including distributed training, inference, and evaluation frameworks. The role involves leading a team, defining technical vision, optimizing GPU infrastructure, and partnering with other teams to power agentic AI experiences.

What you'd actually do

  1. Lead, Mentor, and Grow a world-class team of ML and Systems Engineers, fostering a culture of innovation, ownership, and operational excellence that aligns with Moveworks' core principles.
  2. Own the Technical Vision and roadmap for the end-to-end ML platform that powers the entire lifecycle—from data synthesis and distributed training to ultra-low-latency inference and serving—for hundreds of production models, including our proprietary MoveLM series.
  3. Drive the Strategy for model performance and efficiency, making critical architectural decisions to optimize our GPU infrastructure for latency, throughput, and cost at massive scale.
  4. Partner with Leaders across agentic platform, search platform, product engineering, and core infrastructure teams to define and deliver the foundational infrastructure that will power the next generation of agentic AI experiences.
  5. Champion a Product Mindset for your platform, building powerful abstractions and tools that accelerate the velocity of machine learning engineers and researchers across the organization.

Skills

Required

  • Master's or Ph.D. in Computer Science, Machine Learning, or a related field
  • 5+ years of industry experience with a proven track record of leading or managing high-performing machine learning or infrastructure teams
  • Deep technical expertise in designing, building, and scaling end-to-end machine learning systems in production environments
  • Strong command of Python
  • experience with performant languages such as C++ or GoLang
  • Extensive experience with deep learning frameworks like PyTorch or Hugging Face
  • Hands-on experience with modern LLM infrastructure, including distributed training frameworks (e.g., Deepspeed) and inference/serving frameworks (e.g., vLLM, TensorRT-LLM, Kubernetes)
  • A strategic mindset with experience balancing the demands of operating robust, scalable infrastructure with the need for forward-looking research and development
  • Excellent communication and collaboration skills, with experience working cross-functionally to deliver complex projects

What the JD emphasized

  • absolutely critical
  • end-to-end ML/LLM lifecycle
  • distributed training
  • inference
  • model evaluation frameworks
  • LLM latency optimization
  • massive scale
  • next generation of agentic AI experiences

Other signals

  • ML Infrastructure
  • LLM lifecycle
  • distributed training
  • inference optimization
  • model evaluation