Senior Ai/ml Ops Engineer-ii (hybrid in Bangalore)

Smartsheet Smartsheet · Seattle · India · Business Intelligence & Ops

Senior AI/ML Ops Engineer responsible for designing, developing, and overseeing scalable and reliable AI/ML Ops platforms and pipelines. This includes model deployment, CI/CD automation, infrastructure management for training and serving, monitoring, resource optimization, and managing foundation models, fine-tuning, and RAG stacks. The role requires close collaboration with data scientists and engineers, and experience with MLOps workflows on Databricks, MLflow, and AI/ML frameworks like LangChain.

What you'd actually do

  1. Designing, Developing and overseeing the strategy and architecture of scalable and reliable AI/ML Ops platforms / pipelines
  2. Model Deployment: Package and deploy AI/ML services to production, ensuring they are reproducible and interpretable
  3. CI/CD Pipeline Development: Design and implement automated CI/CD (Continuous Integration/Continuous Deployment) pipelines to accelerate model deployment using tools
  4. Infrastructure Management: Provision and optimize infrastructure for training and serving, utilizing Docker, Kubernetes, or serverless platforms
  5. Monitoring & Observability : Implement post-deployment monitoring for model performance, data drift, and latency using tools.

Skills

Required

  • Experience in building and maintaining AI/ML Ops platform systems
  • System design, AI/ML Frameworks and tools
  • AI/MLOps workflows on Databricks, MLFlow, Mosaic AI Agent Framework, Unity Catalog, Vector Search, Knowledge Graph
  • AI/ML frameworks like LangChain, LangGraph
  • Cloud Platforms: Hands-on experience with at least one major cloud provider (AWS, Azure, or GCP)
  • Programming languages like Python and SQL
  • Modern software engineering practices like Kubernetes, CI/CD, IAC tools (Preferably Terraform), Observability, monitoring and alerting
  • Solution Cost Optimisations and design to cost

Nice to have

  • Experience in Monte Carlo is preferable
  • Experience in AWS hosted data platform is preferable
  • Experience with AWS Bedrock is preferable
  • Experience with Terraform

What the JD emphasized

  • AI/ML Ops platform systems ensuring scalability, reliability, efficiency and security
  • In depth experience in System design, AI/ML Frameworks and tools involving large Petabytes of data with Databricks Lakehouse ecosystem
  • AI/MLOps workflows on Databricks , MLFlow, Mosaic AI Agent Framework, Unity Catalog, Vector Search, Knowledge Graph
  • Knowledge of AI/ML frameworks like LangChain, LangGraph for AI/ML Ops pipeline integration

Other signals

  • AI/ML Ops platforms / pipelines
  • Model Deployment
  • CI/CD Pipeline Development
  • Infrastructure Management for training and serving
  • Monitoring & Observability for model performance
  • Automation of retraining and data pipeline workflows
  • Deployment of foundation models, fine-tuning workflows, and RAG stacks
  • Resource Optimization for GPU/CPU utilization
  • MLOps workflows on Databricks
  • AI/ML frameworks like LangChain, LangGraph