Senior Manager, Engineering

Crusoe · Data AI · Dublin - IE · Cloud Engineering

Crusoe is an AI infrastructure company that owns and operates its own power and data centers to accelerate AI workloads. This role is for a Senior Engineering Manager, Production Engineering, to lead a 24/7 team responsible for the reliability and operational health of Crusoe's production GPU infrastructure. The role involves people management, incident response, monitoring, automation, and strategic planning for the production engineering function.

What you'd actually do

  1. Manage, coach, and grow a team of production engineers across shifts and time zones. Run structured 1:1s focused on career development, deliver candid performance feedback, and build a team culture grounded in ownership and continuous improvement.
  2. Partner with engineering leadership and recruiting to grow the team — owning the full hiring lifecycle from interview design to offer. Build and continuously improve onboarding and training programs that ramp new engineers quickly and effectively.
  3. Serve as an escalation point for high-severity incidents. Lead postmortems with a focus on systemic fixes, ensure action items are tracked and completed, and drive down MTTR over time.
  4. Define, monitor, and report on SLIs, SLOs, and SLAs across Crusoe's production systems. Surface trends proactively and partner with engineering teams to address reliability gaps before they become customer issues.
  5. Oversee the design and maintenance of alerting and observability systems across bare-metal and cloud infrastructure, ensuring the team has the signal it needs to detect and respond to issues fast.

Skills

Required

  • 6+ years of experience managing 24/7 technical operations or SRE teams
  • Strong Linux and infrastructure fundamentals
  • hands-on experience with containerization, Kubernetes, and virtualization in production environments
  • Observability and monitoring expertise
  • Familiarity with messaging and workflow systems such as RabbitMQ, Kafka, NATS, or Temporal
  • Working proficiency in Golang or Python
  • Demonstrated people management skills
  • SLA/SLO ownership experience
  • A track record of influencing cross-functional strategy and driving alignment across engineering leadership on operational priorities

Nice to have

  • Experience with GPU infrastructure, HPC, or AI/ML cloud environments
  • Familiarity with infrastructure-as-code tooling such as Terraform or Ansible
  • Experience scaling an operations team and function

What the JD emphasized

  • accelerate the abundance of energy and intelligence
  • AI infrastructure
  • energy-first approach
  • scale of our ambition
  • path not fully paved
  • meaningful work of your career
  • advance their AI strategies
  • high-performing team
  • Production Engineering
  • GPU infrastructure
  • reliability and operational health
  • deep technical leadership
  • organizational impact
  • reliability strategy
  • building a high-performing team
  • complex systems reliable at scale
  • significant ownership
  • strategic impact
  • critical moment
  • Team Leadership & Development
  • Incident Management
  • Reliability & SLO Ownership
  • Monitoring & Alerting
  • Automation & Toil Reduction
  • Cross-Functional Partnership
  • Operational Cadence
  • 6+ years of experience managing 24/7 technical operations or SRE teams
  • demonstrated success developing senior engineers
  • building organizational capability
  • improving operational outcomes at scale
  • Strong Linux and infrastructure fundamentals
  • hands-on experience with containerization, Kubernetes, and virtualization in production environments
  • Observability and monitoring expertise
  • Prometheus, VictoriaMetrics, and custom exporters
  • bare-metal endpoints
  • messaging and workflow systems
  • RabbitMQ, Kafka, NATS, or Temporal
  • distributed production environments
  • Golang or Python
  • review production code
  • technical design discussions
  • support your engineers' work
  • Demonstrated people management skills
  • structured performance management
  • individualized coaching
  • building or improving onboarding and training programs
  • SLA/SLO ownership experience
  • customer-facing environment
  • influencing cross-functional strategy
  • driving alignment across engineering leadership
  • operational priorities
  • GPU infrastructure
  • HPC
  • AI/ML cloud environments
  • infrastructure-as-code tooling
  • Terraform or Ansible
  • scaling an operations team and function