Staff Software Engineer(mlops)

Toast Toast · Enterprise · Bangalore, India · R & D : Engineering : Fintech Data Science & AI

Staff Machine Learning Engineer to design and lead the evolution of a unified MLOps platform supporting diverse needs across Toast, ensuring high availability, scalability, and security of ML services. This role involves championing MLOps best practices, leading collaborative efforts between Data Engineering, DevOps, and Product teams, defining technical vision for AI infrastructure, setting standards for observability and incident response for ML systems, and mentoring engineers.

What you'd actually do

  1. Design and lead the evolution of a unified MLOps platform that supports diverse needs across Toast, ensuring high availability, scalability, and security of ML services.
  2. Champion and institutionalize best practices for CI/CD for ML (MLOps), automated testing, and infrastructure-as-code (Terraform).
  3. Lead collaborative efforts across Data Engineering, DevOps, and Product teams to bridge the gap between model prototyping and production-grade reliability.
  4. Partner with leadership and Product Managers to define the 1-2 year technical vision for AI infrastructure, prioritizing long-term stability over short-term fixes.
  5. Set the standard for observability and incident response for ML systems, driving root-cause analysis for complex system failures.

Skills

Required

  • Python, Scala, or Java/Kotlin
  • PySpark
  • high-performance computing
  • Vector Databases
  • LangChain/LangGraph
  • A2A protocols
  • design distributed systems
  • AWS tooling (SageMaker, DynamoDB, Athena, Glue, etc.)
  • microservice based architecture
  • software engineering best practices
  • object-oriented programming
  • test-driven development
  • CI/CD
  • git
  • shell scripting
  • task orchestration
  • MLflow
  • model deployment
  • orchestration (Apache airflow, Prefect)
  • scaling
  • managing CPU/GPU resources efficiently
  • problem-solving
  • analytical skills
  • critical thinking
  • communication
  • interpersonal skills
  • collaborative work

Nice to have

  • Feature Stores (e.g., Tecton, Feast)
  • real-time streaming frameworks (Apache Flink or Ray)
  • React
  • kernel/networking layer debugging
  • statistical concepts (e.g. classification, regression, etc)
  • deep learning algorithms (e.g. CNN, RNN)

What the JD emphasized

  • productionizing and scaling ML systems at the enterprise level
  • taking LLM applications from research to production

Other signals

  • MLOps platform
  • productionizing and scaling ML systems
  • LLM applications from research to production
  • high-performance inference at the edge
  • feature stores
  • automated retraining