Member of Technical Staff (ai Infrastructure Engineer)

Perplexity Perplexity · AI Frontier · San Francisco, CA · AI

AI Infrastructure Engineer responsible for building, deploying, and optimizing large-scale AI training and inference clusters using Kubernetes and Slurm on AWS. The role involves managing HPC environments, developing orchestration systems, implementing resource scheduling, and building monitoring solutions for ML workloads.

What you'd actually do

  1. Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  2. Manage and optimize Slurm-based HPC environments for distributed training of large language models
  3. Develop robust APIs and orchestration systems for both training pipelines and inference services
  4. Implement resource scheduling and job management systems across heterogeneous compute environments
  5. Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure

Skills

Required

  • Kubernetes administration
  • Slurm workload management
  • Python
  • C++
  • PyTorch
  • distributed systems architecture
  • GPU cluster management
  • systems and infrastructure automation
  • networking
  • storage
  • compute resource management for ML workloads
  • batch and real-time workloads
  • debugging
  • monitoring
  • observability tools for containerized environments

Nice to have

  • Kubernetes operators
  • custom controllers for ML workloads
  • multi-cluster federation
  • advanced scheduling policies
  • CUDA optimization
  • TensorFlow
  • distributed training libraries
  • HPC environments
  • parallel computing
  • high-performance networking
  • Terraform
  • Ansible
  • GitOps practices
  • container registries
  • image optimization
  • multi-stage builds for ML workloads

What the JD emphasized

  • Expert-level Kubernetes administration
  • Proficiency with Slurm job scheduling
  • Experience supporting both long-running training jobs and high-availability inference services

Other signals

  • AI training and inference clusters
  • Kubernetes
  • Slurm
  • GPU clusters
  • distributed training
  • inference services