Hpc Performance Engineer

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +2 · Technology

CoreWeave is seeking an HPC Performance Engineer to optimize bare-metal systems, including Linux kernel, virtualization, and container runtimes. The role involves developing performance baselines, regression testing, debugging fabric-level performance, and creating telemetry for distributed clusters. Collaboration with cross-functional teams and providing performance data for business decisions are key responsibilities.

What you'd actually do

  1. In this role, you will play a crucial part in the design, development, and optimization of our bare-metal systems from POST through joining a Kubernetes cluster.
  2. You will collaborate closely with cross-functional teams, up stack engineering teams, and stakeholders to ensure our low-level software stack is performant in the context of hardware updates; and providing data, metrics, dashboards, and analysis to substantiate performance assertions.
  3. Develop and maintain tools for establishing systems performance baselines
  4. Design and maintain performance regression test pipelines for HPC workloads
  5. Debug and Tune fabric-level performance to ensure low-latency high throughput configurations

Skills

Required

  • 5+ years of professional experience in Systems/HPC Performance Engineering, Benchmarking, and/or Validation.
  • Strong experience with MPI workloads and distributed system performance analysis
  • Familiarity with RoCE, InfiniBand, and GPUDirect/Data Direct I/O, NUMA, etc in HPC workloads
  • Hands-on use of public HPC benchmarks (HPCC, HPL, OSU, MLPerf-HPC, STREAM, IO500)
  • Extensive, deep experience in Linux internals
  • Fluency with a programming language geared toward automation (Python preferred, but others possible)
  • Experience writing robust, testable code
  • Experience diagnosing and fixing systems performance issues
  • Experiencing with implementing automation testing
  • Ability to effectively prioritize and communicate proposed features and fixes in a remote-employee environment
  • Strong passion for automation, with a commitment to automating processes comprehensively
  • Excellent documentation skills and attention to detail
  • Strong analytical and problem-solving abilities

Nice to have

  • Familiarity with QA/QE best practices
  • Familiarity with Golang
  • Opinions about software version control and team collaboration
  • Experience working in Cloud environments
  • Experience as a software engineer writing large-scale applications
  • Experience in open-source community software development
  • Experience with machine learning is a huge bonus

What the JD emphasized

  • HPC Performance Engineer
  • bare-metal systems
  • Linux kernel
  • virtualization stack
  • container/pod runtime stack
  • performance assertions
  • HPC workloads
  • low-latency high throughput
  • performance analysis
  • Linux internals
  • automation testing
  • machine learning is a huge bonus