Site Reliability Engineer, Metal

Tenstorrent · Semiconductors · Toronto, ON · AI Software

Site Reliability Engineer role focused on ensuring the reliability, observability, and production readiness of Tenstorrent's large-scale AI systems, both internal and customer-facing. This involves troubleshooting, monitoring, automation, and partnering with engineering teams and customers.

What you'd actually do

  1. Ensure reliability and operational health of Tenstorrent systems across internal and customer environments.
  2. Troubleshoot complex issues across compute, networking, and software layers.
  3. Partner with engineering teams and customers to resolve production incidents.
  4. Design and improve monitoring, observability, and alerting systems.
  5. Build automation to reduce operational toil and improve system reliability.

Skills

Required

  • Site reliability, infrastructure, or systems engineering in distributed environments
  • Linux systems knowledge
  • Observability tools (Prometheus, Grafana)
  • Scripting and automation (Python, Go)
  • Networking fundamentals

What the JD emphasized

  • customer deployments
  • reliability
  • observability
  • production-ready
  • large-scale AI systems