Sr. Machine Learning Operations Engineer

Smartsheet Smartsheet · Seattle · India · Business Intelligence & Ops

This role focuses on engineering and automating the machine learning production lifecycle, including model deployment, retraining, inference, and monitoring. It involves managing cloud infrastructure and ensuring the scalability, reliability, and cost-effectiveness of AI products, acting as a bridge between data scientists and software engineers.

What you'd actually do

  1. Automate the deployment and retraining of ML models, from training through to production inference, by building and managing complete CI/CD/CT (Continuous Training) pipelines, adhering to MLOps best practices.
  2. Build, fine-tune, or use pre-trained LLMs, deep learning models or traditional machine learning models.
  3. Implement model versioning, lineage tracking, and auditing to ensure compliance with security and ethical standards.
  4. Continuously monitor the health and performance of production machine learning models, proactively identifying and correcting model drift, staleness, and performance degradation.
  5. Provision and manage scalable cloud infrastructure using Infrastructure as Code (IaC).

Skills

Required

  • 5+ years of experience with creating, deploying and scaling machine learning solutions in a cloud environment (eg. AWS, GCP, Azure)
  • ability to use tools such as SageMaker, Glue, Lambda, Docker etc. to create ML models and data pipelines
  • 7+ years of programming experience in languages used in AI/ML (eg python, scala etc)
  • 4+ years of experience in developing deep learning and traditional ML models using common frameworks like pytorch, tensorflow, huggingface, scikit-learn etc.
  • Strong applied data science skills - ability to recognize data patterns, understand how and when to use various machine learning approaches (eg. supervised/unsupervised learning, deep learning etc.), and evaluate the performance of ML algorithms.
  • Proven ability to remain up-to-date with the latest advancements in Generative AI approaches (eg. OpenAI, LangChain, Stable Diffusion APIs).
  • Experience developing, documenting, and supporting REST APIs
  • A degree in Computer Science, Engineering, or a related field or equivalent practical experience.

What the JD emphasized

  • architect our machine learning production lifecycle
  • maintain and deploy ML models to a scalable, reliable, and secure production environment
  • design and maintain the infrastructure, automation, and monitoring systems
  • ensure our AI products are high-performing and cost-effective
  • Automate the deployment and retraining of ML models
  • building and managing complete CI/CD/CT (Continuous Training) pipelines
  • adhering to MLOps best practices
  • Build, fine-tune, or use pre-trained LLMs
  • Evaluate and recommend AI or ML solutions
  • Implement model versioning, lineage tracking, and auditing
  • ensure compliance with security and ethical standards
  • Continuously monitor the health and performance of production machine learning models
  • proactively identifying and correcting model drift, staleness, and performance degradation
  • manage necessary model retraining cycles
  • Act as the "glue" between Data Scientists (who build models) and Software Engineers (who consume them).
  • Partner effectively with software engineers, product managers and business functions to integrate the machine learning solutions
  • Provision and manage scalable cloud infrastructure using Infrastructure as Code (IaC).
  • Provide architectural guidance and mentorship
  • Distill complex ML concepts into easy-to-follow technical documentation.

Other signals

  • MLOps
  • CI/CD/CT
  • model deployment
  • inference
  • monitoring
  • cloud infrastructure