Member of Technical Staff - Cloud Infrastructure

xAI xAI · AI Frontier · Palo Alto, CA · Engineering

Seeking a Senior Infrastructure Engineer to design, build, and operate secure, scalable infrastructure for US Government AI projects. This role involves managing training and inference clusters, applications, Kubernetes, and GPU hardware, with a strong emphasis on federal compliance, automation, and observability in a high-security environment.

What you'd actually do

  1. Develop and optimize software to provision and manage xAI’s infrastructure across on-premise, virtual machine, and classified cloud environments, enabling efficient scaling for US government initiatives.
  2. Enhance the reliability, performance, and cost-effectiveness of infrastructure to support large-scale AI and application workloads in secure, classified settings.
  3. Collaborate with xAI engineers to understand workload requirements and design tailored solutions that meet government-specific needs and compliance standards.
  4. Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems, adhering to federal protocols.
  5. Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible, with a focus on secure data handling.

Skills

Required

  • Active Top Secret (TS) security clearance
  • 5+ years of experience as an Infrastructure Engineer, Site Reliability Engineer, or similar role
  • Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible
  • Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components
  • Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs
  • Excellent communication and documentation skills

Nice to have

  • Deep familiarity with installing and using GPU hardware
  • Experience with high-traffic web or mobile application workloads
  • Optimizing Kubernetes for large-scale deployments in classified or federal settings
  • Familiarity with chaos engineering, capacity planning
  • Proficiency with tools such as Kyverno, ArgoCD, or Go programming
  • Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges in secure environments
  • Passion for problem-solving and a proactive drive to deliver impactful results while adhering to security protocols
  • Certifications in security-related fields (e.g., CISSP)
  • experience in secure federal environments

What the JD emphasized

  • Active Top Secret (TS) security clearance
  • stringent federal compliance requirements
  • secure, classified settings
  • secure data handling
  • security and compliance