Member of Technical Staff (ai Infrastructure Engineer)

Perplexity Perplexity · AI Frontier · London, United Kingdom · AI

AI Infrastructure Engineer responsible for building, deploying, and optimizing large-scale AI training and inference clusters using Kubernetes and Slurm on AWS. The role involves managing HPC environments, developing orchestration systems, implementing resource scheduling, and building monitoring solutions for ML workloads.

What you'd actually do

  1. Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  2. Manage and optimize Slurm-based HPC environments for distributed training of large language models
  3. Develop robust APIs and orchestration systems for both training pipelines and inference services
  4. Implement resource scheduling and job management systems across heterogeneous compute environments
  5. Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure

Skills

Required

  • Kubernetes administration
  • YAML configuration management
  • Slurm job scheduling
  • resource management
  • cluster configuration
  • Python
  • C++
  • systems automation
  • infrastructure automation
  • PyTorch
  • distributed training
  • networking
  • storage
  • compute resource management
  • ML workloads
  • APIs
  • distributed systems
  • batch workloads
  • real-time workloads
  • debugging
  • monitoring
  • observability tools
  • containerized environments

Nice to have

  • Kubernetes operators
  • custom controllers
  • ML workloads
  • Advanced Slurm administration
  • multi-cluster federation
  • advanced scheduling policies
  • GPU cluster management
  • CUDA optimization
  • TensorFlow
  • distributed training libraries
  • HPC environments
  • parallel computing
  • high-performance networking
  • infrastructure as code
  • Terraform
  • Ansible
  • GitOps practices
  • container registries
  • image optimization
  • multi-stage builds
  • ML workloads

What the JD emphasized

  • Kubernetes administration
  • Slurm workload management
  • distributed training systems at scale
  • ML workloads
  • GPU clusters

Other signals

  • AI training and inference clusters
  • Kubernetes, Slurm, Python, C++, PyTorch
  • large-scale AI training and inference clusters
  • distributed training
  • inference services
  • GPU clusters