Sr. Software Engineer (data Center Automation)

xAI xAI · AI Frontier · Memphis, TN +1 · Data Center

This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure within a multi-data center environment. The engineer will combine strong coding abilities with data center experience to build scalable reliability services, optimize system performance, and minimize downtime, with a focus on reducing MTTR and ensuring resilience for AI workloads.

What you'd actually do

  1. Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.
  2. Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers—open to innovative stacks beyond traditional ones like ELK.
  3. Collaborate with cross-functional teams—including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management)—to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration).
  4. Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs.
  5. Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation.

Skills

Required

  • Python
  • Rust
  • observability tools
  • metrics collection
  • logging
  • tracing
  • dashboards
  • Linux
  • container orchestration
  • Kubernetes
  • site reliability engineering (SRE)
  • infrastructure engineering
  • DevOps
  • systems engineering
  • data center automation
  • network topologies

Nice to have

  • emerging languages
  • innovative stacks beyond traditional ones like ELK
  • emerging alternatives to Kubernetes
  • facility operations
  • mechanical/electrical teams
  • data center infrastructure management
  • error budgets
  • SLAs
  • kernel tuning
  • scripting for automation
  • connectivity
  • routing
  • redundancy
  • performance issues
  • data center interconnects
  • facility-level controls
  • blameless postmortems
  • continuous improvement initiatives
  • joint exercises with facility teams for physical failover and recovery scenarios
  • mentor junior team members
  • document processes

What the JD emphasized

  • mission-critical AI infrastructure
  • AI workloads
  • reliability principles
  • critical AI services

Other signals

  • AI infrastructure
  • automation
  • observability
  • reliability
  • data center