Machine Learning Infrastructure Engineer- Model Inference

Abridge Abridge · Vertical AI · San Francisco, CA · Builder

This role focuses on building and optimizing the core inference infrastructure for AI models in a healthcare setting. The engineer will design, deploy, and maintain scalable Kubernetes clusters, develop and optimize ML model serving infrastructure for high performance and low latency, and scale backend infrastructure for AI-driven products. Key responsibilities include optimizing compute-heavy workflows, enhancing GPU utilization, and building a robust model API orchestration system, collaborating with research and product teams.

What you'd actually do

  1. Design, deploy and maintain scalable Kubernetes clusters for AI model inference and training
  2. Develop, optimize, and maintain ML model serving infrastructure, ensuring high-performance and low-latency.
  3. Collaborate with ML and product teams to scale backend infrastructure for AI-driven products, focusing on model deployment, throughput optimization, and compute efficiency.
  4. Optimize compute-heavy workflows and enhance GPU utilization for ML workloads.
  5. Build a robust model API orchestration system

Skills

Required

  • building and deploying machine learning models in production environments
  • container orchestration and distributed systems architecture
  • Kubernetes administration
  • developing APIs and managing distributed systems for both batch and real-time workloads
  • communication skills, with the ability to interface between research and product engineering

Nice to have

  • model serving frameworks such as NVIDIA Triton Server, VLLM, TRT-LLM
  • ML toolchains such as PyTorch, Tensorflow or distributed training and inference libraries
  • GPU cluster management and CUDA optimization
  • infrastructure as code (Terraform, Ansible) and GitOps practices
  • container registries, image optimization, and multi-stage builds for ML workloads
  • orchestrating across ASR models or LLM models for building various GenAI applications

What the JD emphasized

  • scalable Kubernetes clusters
  • high-performance and low-latency
  • model deployment, throughput optimization, and compute efficiency
  • Optimize compute-heavy workflows and enhance GPU utilization
  • robust model API orchestration system

Other signals

  • optimize inference infrastructure
  • scale backend infrastructure for AI-driven products
  • optimize compute-heavy workflows and enhance GPU utilization
  • build a robust model API orchestration system