Operations Engineer, Hpc Networking

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +4 · Technology

Operations Engineer focused on deploying, monitoring, troubleshooting, and maintaining large-scale InfiniBand fabrics for AI workloads. Requires experience with networking, Linux, scripting, and monitoring tools.

What you'd actually do

  1. Regularly monitor the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes.
  2. Investigate and resolve operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks.
  3. Assist with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams.
  4. Perform routine maintenance and upgrades on InfiniBand switches and control plane components.
  5. Collaborate with HPC cluster operations teams to provide troubleshooting and operational expertise.

Skills

Required

  • InfiniBand or similar networking technologies
  • Networking concepts
  • Linux system administration
  • Python scripting
  • Ansible
  • Monitoring and visualization platforms (Grafana, Prometheus)

Nice to have

  • Nvidia UFM or similar fabric management tools
  • Data center operations
  • Server racks
  • Cabling
  • Python or Bash scripting

What the JD emphasized

  • At least 1 year of experience with InfiniBand or similar networking technologies.
  • Experience with monitoring and visualization platforms such as Grafana or Prometheus.
  • Applicants must have work authorization that does not require sponsorship from the company now or in the future