Member of Technical Staff - Infrastructure Engineer

Black Forest Labs Black Forest Labs · Multimodal · Freiburg, San Francisco · Engineering

Infrastructure Engineer role focused on building and maintaining the large-scale training platforms and research infrastructure that powers generative AI model development, including scaling compute clusters, ensuring reliability, and optimizing performance.

What you'd actually do

  1. Maintain research infrastructure, ensuring health, and optimizing components to extract peak performance from the system (both on application, and infrastructure side)
  2. Scale infrastructure to meet growing research demands while maintaining reliability and performance
  3. Collaborate with research teams to deeply understand their infrastructure needs, and design solutions that balance performance with cost efficiency.
  4. Identify and resolve performance bottlenecks and capacity hotspots through deep analysis of distributed systems at scale.
  5. Build and evolve telemetry and monitoring systems to provide deep visibility into infrastructure performance, utilization, and costs across our cloud and datacenter fleets.

Skills

Required

  • Python
  • Bash
  • Go
  • Kubernetes
  • Nvidia GPU drivers, and operators
  • OTel
  • Prometheus
  • large-scale training platforms
  • large scale compute clusters (GPUs)
  • debug performance and reliability issues across large distributed fleets
  • modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP
  • SLURM

What the JD emphasized

  • large-scale training platforms
  • large scale compute clusters
  • distributed fleets
  • large-scale training platforms

Other signals

  • research infrastructure
  • large scale compute clusters
  • distributed systems
  • training platforms