Production Engineer (kubernetes)

Crusoe · Data AI · Dublin - IE · Cloud Engineering

Crusoe is an AI infrastructure company focused on accelerating energy and intelligence by providing AI compute. This Production Engineer role focuses on building and maintaining their Managed Kubernetes and Managed VM platforms for external customers, ensuring reliability and performance through monitoring, incident response, and automation. The role involves working with distributed systems, Kubernetes, and observability tooling to support AI workloads.

What you'd actually do

  1. Building Kubernetes Platform: Focus on scaling tooling and features dedicated to Crusoe's Managed Kubernetes and Managed VM platforms for external customers.
  2. Collaboration and Planning: Collaborate with the team in morning stand-up meetings to discuss ongoing projects, recent incidents, and priorities for the day. Collaborate on action plans for deploying new data centers or retrofitting existing ones. Work closely with software engineers, advising on best practices for resilient code and reviewing changes before deployment.
  3. System Monitoring and Alerting: Review overnight alerts and system performance metrics to ensure everything is running smoothly. Analyze system logs and develop tools to enhance our monitoring capabilities.
  4. Incident Response and Problem Solving: Engage in incident response drills, post-mortems, and root cause analysis sessions to learn from past issues and prevent future ones. Resolve common errors automatically through automation and proactive remediation.
  5. Performance Monitoring and Optimization: Stay focused on maintaining high SLIs and SLOs, ensuring that our infrastructure remains robust and reliable for our customers.

Skills

Required

  • Production Engineering Experience: 3-6 years of professional Production Engineer experience.
  • Kubernetes: Experience building Kubernetes platforms or Kubernetes controllers
  • Server Hardware and Provisioning: Exposure to server-class hardware & provisioning.
  • Distributed Systems Architecture: Understanding of distributed system architecture; exposure to common design patterns, reliability, and scaling.
  • Infrastructure Design: Basic understanding of infrastructure design: Familiarity with the operational trade-offs of network, storage, and RPC serving designs.
  • Programming Proficiency: Proficiency with at least one programming language (Python, Go, or similar).
  • Observability Tooling: Exposure to Observability tooling and philosophy: logging, monitoring, and alerting tools.
  • Operating Systems: Experience with Unix/Linux environments.
  • Networking Fundamentals: Understanding of network fundamentals: Basics of TCP/IP and network programming.
  • Information Security Awareness: Awareness of basic information security best practices.
  • Bachelor's Degree in Computer Science, related field, or self-educated in computer science fundamentals.
  • Strong communication skills.

What the JD emphasized

  • gold standard reliability and performance of Crusoe's AI platform