Machine Learning Infrastructure Engineer

Character AI Character AI · AI Frontier · Redwood City, CA · Technical Staff - ML

Machine Learning Infrastructure Engineer to design, build, and maintain training and serving infrastructure for ML research, focusing on GPU allocation and utilization, cluster issue diagnosis, and deployment monitoring.

What you'd actually do

  1. Provide infrastructure support to our ML research and product
  2. Build tooling to diagnose cluster issues and hardware failures
  3. Monitor deployments, manage experiments, and generally support our research
  4. Maximize GPU allocation and utilization for both serving and training

Skills

Required

  • ML infrastructure
  • GPU support
  • Cloud platforms (Compute Engine, Kubernetes, Cloud Storage)
  • Tool development for infrastructure diagnosis

Nice to have

  • Large GPU clusters
  • High-performance computing/networking
  • Large language model training support
  • ML frameworks (Pytorch/TensorFlow/JAX)
  • GPU kernel development

What the JD emphasized

  • 4+ years of experience supporting the infrastructure within an ML environment
  • Experience with large GPU clusters and high-performance computing/networking
  • Experience with supporting large language model training

Other signals

  • ML Infrastructure
  • GPU utilization
  • Training and Serving