Member of Technical Staff, Cluster Management

Fireworks AI · Data AI · San Mateo, CA · Engineering

This role focuses on managing and ensuring the reliability, performance, and efficiency of a large-scale virtual AI cloud infrastructure, including GPU clusters. It involves system reliability, incident management, observability, automation, capacity planning, and performance tuning, with a strong emphasis on SRE principles and cloud platforms.

What you'd actually do

  1. Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure.
  2. Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability.
  3. Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance.
  4. Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management.
  5. Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization.

Skills

Required

  • Site Reliability Engineering
  • DevOps
  • large-scale production systems
  • SRE principles and practices
  • SLOs
  • SLIs
  • operational automation
  • incident management
  • post-mortems
  • public cloud platforms (AWS, GCP, Azure)
  • compute
  • networking
  • storage
  • database services
  • containerization technologies (Docker)
  • Kubernetes
  • monitoring
  • logging
  • alerting systems
  • Prometheus
  • Grafana
  • ELK stack
  • distributed tracing
  • Python
  • Go
  • Linux operating systems
  • networking fundamentals
  • system debugging
  • troubleshoot complex issues across the entire stack
  • on-call rotations

Nice to have

  • managing data center grade GPU clusters
  • GPU monitoring
  • troubleshooting GPU clusters
  • machine learning infrastructure
  • model serving
  • distributed AI frameworks
  • security
  • data protection

What the JD emphasized

  • world-scale virtual AI cloud
  • scale cutting-edge AI platforms
  • multi-cloud infrastructure
  • GPU clusters