Member of Technical Staff

xAI xAI · AI Frontier · Memphis, TN · Engineering

Seeking a Member of Technical Staff to manage and enhance reliability for a multi-data center AI infrastructure. This role focuses on automating processes, building observability solutions, and ensuring seamless operations for mission-critical AI infrastructure. The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime, including partnership with facility operations. The primary objective is to mitigate downtime and minimize impact to end-users through proactive automation, robust observability, and integrated software-physical reliability strategies, ensuring AI infrastructure remains resilient and scalable.

What you'd actually do

  1. Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning. We value adaptability to new tools and paradigms in the fast-evolving AI space.
  2. Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers—open to innovative stacks beyond traditional ones like ELK.
  3. Collaborate with cross-functional teams—including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management)—to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration).This role encourages broad skill sets from diverse technical backgrounds to foster innovation.
  4. Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs.
  5. Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation.

Skills

Required

  • Python
  • Rust
  • site reliability engineering (SRE)
  • infrastructure engineering
  • DevOps
  • systems engineering
  • data center operations
  • Linux system optimization
  • container orchestration (e.g., Kubernetes)
  • scripting
  • network troubleshooting
  • observability tools (metrics, logging, tracing)
  • incident response
  • automation

Nice to have

  • facility operations
  • mechanical/electrical teams
  • data center infrastructure management
  • error budgets
  • SLAs
  • Kubernetes or emerging alternatives
  • ELK stack

What the JD emphasized

  • mission-critical AI infrastructure
  • AI workloads demand near-zero downtime
  • reduce mean time to recovery (MTTR) by up to 50%
  • minimize impact to end-users
  • AI infrastructure remains resilient, scalable
  • automate reliability workflows
  • implement and maintain observability tools
  • reliability bottlenecks
  • automate solutions for fault tolerance
  • disaster recovery
  • capacity planning
  • physical/environmental risk mitigation
  • troubleshoot and resolve complex issues
  • adhering to reliability principles like error budgets and SLAs
  • optimize Linux-based systems for performance, security, and reliability
  • container orchestration
  • scripting for automation
  • network topologies and concepts in large-scale, multi-data center environments
  • troubleshoot connectivity, routing, redundancy, and performance issues
  • integrate observability into data center interconnects and facility-level controls
  • automated failover mechanisms
  • handle both digital and physical disruptions
  • seamless continuity for end-users
  • on-call rotations
  • post-incident reviews (blameless postmortems)
  • continuous improvement initiatives
  • enhance overall site reliability
  • joint exercises with facility teams for physical failover and recovery scenarios
  • mentor junior team members
  • document processes
  • foster a culture of automation, knowledge sharing, and adaptability to new technologies

Other signals

  • AI workloads demand near-zero downtime
  • mission-critical AI infrastructure
  • ensuring our AI infrastructure remains resilient, scalable, and at the cutting edge of innovation