Senior Ai/ml Ops Engineer (hybrid in Bangalore)

Smartsheet Smartsheet · Seattle · India · Business Intelligence & Ops

Senior AI/ML Ops Engineer responsible for designing, developing, and overseeing scalable and reliable AI/ML Ops platforms and pipelines. This includes model deployment, CI/CD pipeline development, infrastructure management for training and serving, monitoring, automation of retraining, deployment of foundation models and RAG stacks, resource optimization, and collaboration with data science and engineering teams. The role requires experience with enterprise SaaS, large-scale data, AI/MLOps workflows on platforms like Databricks and MLFlow, cloud platforms, and modern software engineering practices.

What you'd actually do

  1. Designing, Developing and overseeing the strategy and architecture of scalable and reliable AI/ML Ops platforms / pipelines
  2. Model Deployment: Package and deploy AI/ML services to production, ensuring they are reproducible and interpretable
  3. CI/CD Pipeline Development: Design and implement automated CI/CD (Continuous Integration/Continuous Deployment) pipelines to accelerate model deployment using tools
  4. Infrastructure Management: Provision and optimize infrastructure for training and serving, utilizing Docker, Kubernetes, or serverless platforms
  5. Monitoring & Observability : Implement post-deployment monitoring for model performance, data drift, and latency using tools.

Skills

Required

  • Enterprise SaaS software solutions with high availability and scalability
  • Solution handling large scale structured and unstructured data from varied data sources
  • Experience in building and maintaining AI/ML Ops platform systems ensuring scalability, reliability, efficiency and security
  • Working with Product engineering team to influence designs with data, AI and analytics use cases in mind
  • In depth experience in System design, AI/ML Frameworks and tools involving large Petabytes of data with Databricks Lakehouse ecosystem
  • AI/MLOps workflows on Databricks , MLFlow, Mosaic AI Agent Framework, Unity Catalog, Vector Search, Knowledge Graph
  • Knowledge of AI/ML frameworks like LangChain, LangGraph for AI/ML Ops pipeline integration
  • Cloud Platforms: Hands-on experience with at least one major cloud provider (AWS, Azure, or GCP).
  • Programming languages like Python and SQL
  • Modern software engineering practices like Kubernetes, CI/CD, IAC tools (Preferably Terraform), Observability, monitoring and alerting
  • Solution Cost Optimisations and design to cost

Nice to have

  • Experience in Monte Carlo is preferable
  • Experience with AWS Bedrock is preferable
  • Experience in AWS hosted data platform is preferable
  • Preferably Terraform

What the JD emphasized

  • high availability and scalability
  • AI/ML Ops platform systems ensuring scalability, reliability, efficiency and security
  • large scale structured and unstructured data
  • Petabytes of data
  • high availability and scalability

Other signals

  • AI/ML Ops platforms / pipelines
  • Model Deployment
  • CI/CD Pipeline Development
  • Infrastructure Management for training and serving
  • Monitoring & Observability for model performance
  • Automation of retraining and data pipeline workflows
  • Deployment of foundation models, fine-tuning workflows, and RAG stacks
  • Resource Optimization for GPU/CPU utilization
  • AI/MLOps workflows on Databricks, MLFlow, Mosaic AI Agent Framework
  • Cloud Platforms (AWS, Azure, GCP)