Distributed Software Engineer

Cerebras · Semiconductors · Headquarters +2 · Software

The role is for a Distributed Software Engineer at Cerebras, a company that builds large AI chips and supercomputers. The engineer will be responsible for automating bare-metal configuration, developing push-button workflows for cluster management, and building an orchestration and scheduler system for resource allocation in a multi-user environment. The role also involves supporting both on-premise and cloud deployments, implementing robust monitoring and failure handling systems, and developing user and administrator facing tools for cluster management.

What you'd actually do

  1. Automate bare-metal configuration of networking, OS, and application software in large clusters of Cerebras WSE, servers, and switches.
  2. Additional push button workflows for cluster upgrades, downgrades, and security patching with key metrics to minimize downtime on clusters.
  3. An orchestration and scheduler system for resource allocation, job submission C placements for a multi-user environment on a cluster.
  4. Seamless support for both on-premise and cloud mode deployment and operations.
  5. A robust system for monitoring, detecting and handling failures for a variety of resources on the clusters (including High Availability of clusters).

Skills

Required

  • software architecture
  • system design
  • distributed cluster development
  • Kubernetes (K8s) software ecosystem
  • Prometheus
  • Grafana
  • GoLang
  • Python
  • bash
  • debugging distributed systems
  • test development

What the JD emphasized

  • Strong track record of software architecture, system design and development.
  • Strong track record of development in distributed cluster.
  • Strong understanding of Kubernetes (K8s) software ecosystem, Prometheus and Grafana.
  • Strong development skills in GoLang, Python, bash.
  • Strong debugging skills with distributed systems.
  • Strong skill to develop tests for the new features and regress old features.