ML Ops Engineer

Augury Augury · Vertical AI · Bengaluru India · R&D

MLOps Engineer role focused on building and operating scalable ML and AI systems for Industrial AI. Responsibilities include designing and evolving MLOps capabilities across the full ML lifecycle, building systems for experiment tracking and reproducibility, developing reusable platform tooling, and building operational infrastructure for LLM and agentic systems. Requires strong Python engineering skills and experience with ML platform and MLOps frameworks.

What you'd actually do

  1. Design and evolve production MLOps capabilities across the full ML lifecycle including datasets, features, models, evaluations, deployments, monitoring, retraining, and feedback signals.
  2. Build systems for experiment tracking, artifact management, reproducibility, versioning, lineage, promotion workflows, and production readiness.
  3. Develop reusable platform tooling, golden paths, and engineering standards that improve consistency and delivery velocity across teams.
  4. Build operational infrastructure for LLM and agentic systems including prompts, tools, traces, evaluations, observability, safety boundaries, and production monitoring.
  5. Design evaluation and monitoring frameworks for AI systems including answer quality, latency, grounding, reliability, and operational regressions.

Skills

Required

  • 5+ years of professional software engineering, MLOps, or ML platform engineering experience in production environments.
  • Significant experience building or owning production ML infrastructure and lifecycle systems.
  • Strong Python engineering skills with production-grade architecture, modular design, testing, packaging, and robust error handling.
  • Strong understanding of the end-to-end ML lifecycle including training, deployment, monitoring, retraining, reproducibility, and lineage.
  • Experience working with large-scale data platforms such as Databricks, Spark, Delta Lake, or equivalent ecosystems.
  • Experience with ML platform and MLOps frameworks such as MLflow, Metaflow, Kubeflow, or equivalent ML lifecycle-management systems.
  • Proven ability to design reusable workflow orchestration using Airflow, Metaflow, or Databricks, covering automation, scheduling, dependency management, and production reliability.
  • Familiarity with operational patterns for LLMOps, AgentOps, and production AI systems.
  • Strong written and verbal communication skills in English.

Nice to have

  • Experience with industrial, IoT or manufacturing platforms.
  • Experience with feature stores, model registries, dataset versioning, and lineage systems.
  • Experience with AI agents, RAG systems, production GenAI applications, or evaluation frameworks.

What the JD emphasized

  • production engineering experience building and operating scalable ML and AI systems
  • software-first MLOps platform role focused on production reliability
  • ML lifecycle management
  • large-scale training infrastructure
  • operational AI systems
  • reusable platform capabilities
  • production MLOps capabilities
  • ML lifecycle
  • production readiness
  • reusable platform tooling
  • operational infrastructure for LLM and agentic systems
  • production monitoring
  • evaluation and monitoring frameworks for AI systems
  • large-scale training pipelines
  • production-grade Python services
  • engineering quality through automated testing
  • CI/CD
  • observability
  • deployment standards
  • operational best practices
  • production environments
  • production ML infrastructure and lifecycle systems
  • production-grade architecture
  • robust error handling
  • end-to-end ML lifecycle
  • large-scale data platforms
  • ML platform and MLOps frameworks
  • reusable workflow orchestration
  • production reliability
  • LLMOps
  • AgentOps
  • production AI systems
  • production foundation
  • production-grade AI platforms
  • scaling ML systems
  • operational backbone of Industrial AI

Other signals

  • MLOps platform role focused on production reliability
  • ML lifecycle management
  • large-scale training infrastructure
  • operational AI systems
  • reusable platform capabilities