Senior Machine Learning Operations Engineer

Smartsheet Smartsheet · Seattle · India · Engineering - Developers

Senior Machine Learning Operations Engineer responsible for architecting and maintaining the ML production lifecycle, including model deployment, automation, monitoring, and infrastructure management in a cloud environment. The role focuses on ensuring AI products are scalable, reliable, and cost-effective, acting as a liaison between data scientists and software engineers.

What you'd actually do

  1. Automate the deployment and retraining of ML models, from training through to production inference, by building and managing complete CI/CD/CT (Continuous Training) pipelines, adhering to MLOps best practices.
  2. Implement model versioning, lineage tracking, and auditing to ensure compliance with security and ethical standards.
  3. Continuously monitor the health and performance of production machine learning models, proactively identifying and correcting model drift, staleness, and performance degradation.
  4. Provision and manage scalable cloud infrastructure using Infrastructure as Code (IaC).
  5. Act as the "glue" between Data Scientists (who build models) and Software Engineers (who consume them).

Skills

Required

  • 5+ years of experience with creating, deploying and scaling machine learning solutions in a cloud environment (eg. AWS, GCP, Azure)
  • ability to use tools such as SageMaker, Glue, Lambda, Docker etc. to create ML models and data pipelines
  • 7+ years of programming experience in languages used in AI/ML (eg python, scala etc)
  • 4+ years of experience in developing deep learning and traditional ML models using common frameworks like pytorch, tensorflow, huggingface, scikit-learn etc.
  • Strong applied data science skills - ability to recognize data patterns, understand how and when to use various machine learning approaches (eg. supervised/unsupervised learning, deep learning etc.), and evaluate the performance of ML algorithms.
  • Proven ability to remain up-to-date with the latest advancements in Generative AI approaches (eg. OpenAI, LangChain, Stable Diffusion APIs).
  • Experience developing, documenting, and supporting REST APIs
  • A degree in Computer Science, Engineering, or a related field or equivalent practical experience.

What the JD emphasized

  • MLOps best practices
  • production inference
  • model drift
  • performance degradation
  • scalable cloud infrastructure

Other signals

  • MLOps
  • CI/CD/CT
  • model deployment
  • inference
  • monitoring
  • cloud infrastructure