Sr. ML Engineer

Visa Visa · Fintech · Austin, TX

Sr. ML Engineer role focused on designing, building, and managing scalable cloud infrastructure for AI/ML applications, with expertise in MLOps, AWS, Kubernetes, and system design. The role involves owning ML platform modules, performing architectural reviews, and implementing deployment standards for both traditional ML and Generative AI (LLM) workloads. Key responsibilities include building secure, scalable pipelines and serving infrastructure, acting as a design authority for model deployment and automation, and enabling seamless transition of models from research to production.

What you'd actually do

  1. Design, build, and maintain scalable, highly available Machine Learning infrastructure on AWS and Visa OnPrem.
  2. Deploy, configure, and manage Kubernetes clusters and Kubeflow to orchestrate complex ML training and deployment pipelines.
  3. Build robust serving infrastructure to productionize machine learning models and Large Language Models (LLMs) using modern serving frameworks (e.g., vLLM, TensorRT-LLM, KServe, Triton).
  4. Design secure platform architectures utilizing AWS IAM (roles, policies, least privilege), VPCs, and security groups to ensure data and model security.
  5. Architect scalable cloud systems and automate infrastructure provisioning using tools like Terraform or AWS CloudFormation.

Skills

Required

  • MLOps
  • AWS cloud architecture
  • Kubernetes
  • system design
  • ML platform architecture
  • Kubeflow
  • model deployment
  • LLMOps
  • serving frameworks (vLLM, TensorRT-LLM, KServe, Triton)
  • AWS IAM
  • VPCs
  • security groups
  • Terraform or AWS CloudFormation
  • CI/CD pipelines
  • observability and monitoring tools (CloudWatch, Prometheus, Grafana)

Nice to have

  • Visa OnPrem
  • Cloud-agnostic experience

What the JD emphasized

  • scalable cloud infrastructure
  • MLOps
  • AWS cloud architecture
  • Kubernetes
  • system design
  • ML platform
  • deployment standards
  • secure, scalable pipelines
  • serving infrastructure
  • Generative AI (LLM) workloads
  • model deployment
  • infrastructure automation
  • platform security
  • data scientists and AI engineers
  • model development lifecycle
  • productionize machine learning models
  • Large Language Models (LLMs)
  • modern serving frameworks
  • secure platform architectures
  • data and model security
  • scalable cloud systems
  • automate infrastructure provisioning
  • automated CI/CD pipelines
  • model training, testing, and deployment
  • continuous integration
  • compute and tooling needs
  • model development lifecycle
  • logging, monitoring, and alerting
  • system health
  • model drift
  • latency
  • modernize legacy deployment pipelines
  • emerging AI infrastructure technologies
  • ML serving infrastructure
  • secure architectures
  • automating infrastructure
  • CI/CD pipelines for ML model deployment
  • monitoring tools

Other signals

  • MLOps
  • Kubernetes
  • AWS
  • LLMOps
  • serving infrastructure