Sr. Member of Technical Staff

Cerebras Cerebras · Semiconductors · Headquarters +1 · Software

This role focuses on designing and developing software features for system resiliency and high availability in distributed environments, specifically for AI inference services. Responsibilities include building scalable inference services, cloud deployment workflows, improving reliability via automation, and collaborating across engineering teams to deliver high-performance software solutions. The role requires experience with infrastructure-as-code, containerization, compute services, monitoring, and programming languages like Python.

What you'd actually do

  1. Design and develop software features that support system resiliency and high availability, including automated recovery mechanisms and fault-tolerant architecture across distributed environments.
  2. Develop and maintain cloud-based deployment workflows for AI inference software using AWS tools and services to support low-latency and scalable system performance.
  3. Develop Python-based scripts and APIs to streamline data preprocessing, inference execution, and post-processing for real-time inference tasks.
  4. Use parallel programming techniques (e.g., multi-threading, asynchronous processing) to maximize resource efficiency on AWS compute instances.
  5. Develop software components to support visualization and analysis of system performance metrics, enhancing the monitoring and usability of inference services.

Skills

Required

  • Terraform
  • AWS CloudFormation
  • AWS CDK
  • Ansible
  • Docker
  • Kubernetes
  • AWS EKS
  • AWS Elastic Container Service (ECS)
  • AWS Fargate
  • Helm
  • AWS EC2
  • AWS Lambda functions
  • Auto Scaling Groups
  • AWS CloudWatch
  • AWS X-Ray
  • ELK (Elasticsearch, Logstash, Kibana)
  • Prometheus
  • Grafana
  • Python
  • Node.js
  • JavaScript
  • Flask
  • PostgreSQL
  • Redis
  • NFS
  • Jenkins
  • Git

What the JD emphasized

  • high availability
  • low-latency
  • real-time inference
  • system reliability
  • inference services

Other signals

  • build and maintain scalable AI inference services
  • develop cloud-based deployment workflows
  • improve system reliability through automation
  • deliver high-performance software solutions