AI Infrastructure Engineer

Together AI Together AI · Data AI · San Francisco, CA · Engineering

AI Infrastructure Engineer responsible for keeping user-facing services and production systems running smoothly, specializing in systems, availability, reliability, and scalability, with interests in algorithms and distributed systems. Builds and runs infrastructure using Ansible, Terraform, and Kubernetes, and develops monitoring systems.

What you'd actually do

  1. Participate in on-call rotation (Pagerduty) to respond to production incidents
  2. Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
  3. Build monitoring systems to ensure the highest quality service for our customers
  4. Design and implement operational processes (such as deployments and upgrades)
  5. Identify improvements for the product architecture from the reliability, performance and availability perspectives

Skills

Required

  • Ansible
  • Terraform
  • Kubernetes
  • monitoring and observability practices
  • cloud services
  • programming/scripting languages

Nice to have

  • systems (operating systems, storage subsystems, networking)
  • algorithms
  • distributed systems

What the JD emphasized

  • 5+ years of professional AI Infra or related experience

Other signals

  • AI infrastructure
  • production systems
  • scaling
  • reliability
  • performance
  • availability