Software Development Engineer - Testing Infrastructure & Performance, Trainium/neuron, Annapurna Labs

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Development Engineer to own and evolve the testing infrastructure and performance pipeline that validates Neuron software across workloads, models, and hardware configurations. Build data pipelines that collect, process, and surface performance metrics and dashboards for visibility into regression detection, performance trends, and release readiness.

What you'd actually do

  1. Design, build, and maintain automated testing infrastructure (Python) that validates Neuron compiler, runtime, and framework integrations across hardware targets
  2. Build and operate data pipelines that ingest performance benchmarks, test results, and system metrics from distributed test runs into centralized data stores
  3. Develop performance dashboards that surface regression detection, trend analysis, and release-readiness signals to engineering teams and leadership
  4. Create and maintain integration test frameworks that validate end-to-end model compilation, execution, and performance across Trainium configurations
  5. Build automated alerting and triage tooling that identifies performance regressions early and routes them to the right owners

Skills

Required

  • Python
  • data pipelines
  • testing frameworks
  • test automation

Nice to have

  • Machine and Deep Learning toolkits (MXNet, TensorFlow, Caffe, PyTorch)
  • performance dashboards
  • observability tooling (Grafana, QuickSight, CloudWatch)
  • distributed systems testing
  • hardware-in-the-loop test infrastructure
  • AWS services (S3, DynamoDB, Lambda, Step Functions, ECS)
  • statistical methods for regression detection
  • performance analysis

What the JD emphasized

  • performance characterization systems
  • performance pipeline
  • performance metrics
  • performance dashboards
  • performance trends
  • performance benchmarks
  • performance analysis

Other signals

  • testing infrastructure
  • performance characterization
  • ML workloads
  • AWS custom AI training and inference chips