Rack Scale Serviceability & Telemetry Architect

AMD AMD · Semiconductors · Austin, TX · Engineering

This role focuses on the architecture for rack-scale serviceability and telemetry for AMD Instinct platforms, which are used in AI and HPC deployments. The architect will define end-to-end manageability, observability, and serviceability across node, chassis, rack, and fleet domains, driving strategy and delivery of standards-based solutions for inventory, discovery, health monitoring, telemetry, eventing, diagnostics, firmware lifecycle management, and field service workflows. The role involves deep technical system architecture, first-principles thinking, and delivering solutions for servers, accelerators, storage, networking, or rack-scale AI/HPC platforms, with a strong emphasis on DMTF Redfish and OpenBMC.

What you'd actually do

  1. Define and own the end-to-end rack-scale serviceability and telemetry architecture for AMD Instinct-based solutions, spanning node BMC, chassis/rack management, service processors/controllers, management network, and fleet-level observability integration.
  2. Define the standards strategy and interface architecture using DMTF Redfish, PLDM, MCTP, and related specifications, maximizing standards compliance while establishing AMD/OEM extensions only where required.
  3. Drive OpenBMC-based architecture and implementation direction for BMC and rack management controllers, including D-Bus object models, bmcweb/Redfish requirements, sensor and FRU inventory models, logging, eventing, firmware update, and debug workflows.
  4. Architect telemetry frameworks for health, power, thermal, inventory, error, utilization, and service data. Define schemas, metric taxonomies, triggers, event models, aggregation, retention, and reporting strategies required for at-scale observability and automated service operations.
  5. Define platform serviceability flows covering discovery, inventory correlation, fault isolation, diagnostics, crashdump and error capture, remote recovery, FRU replacement, firmware/driver update orchestration, and return-to-service procedures.

Skills

Required

  • System architecture
  • Rack-scale solutions
  • Serviceability
  • Telemetry
  • Manageability
  • DMTF Redfish
  • OpenBMC
  • PLDM
  • MCTP
  • Server firmware
  • Embedded software
  • Data center infrastructure
  • AI/HPC platforms

Nice to have

  • First-principles thinking
  • Influence without authority
  • Raise execution quality across teams
  • Direct, humble, collaborative, and inclusive leadership
  • Standards compliance
  • D-Bus object models
  • bmcweb/Redfish requirements
  • Sensor and FRU inventory models
  • Logging
  • Eventing
  • Firmware update
  • Debug workflows
  • Metric taxonomies
  • Triggers
  • Event models
  • Aggregation
  • Retention
  • Reporting strategies
  • Fault isolation
  • Crashdump and error capture
  • Remote recovery
  • FRU replacement
  • Firmware/driver update orchestration
  • Return-to-service procedures
  • Validation and conformance strategy
  • Interoperability
  • Fault injection
  • Scale testing
  • Field debug methodology
  • AMD Instinct platform roadmaps
  • Open-source communities

What the JD emphasized

  • deeply technical system architect
  • track record of delivering manageability, telemetry, and serviceability solutions for servers, accelerators, storage, networking, or rack-scale AI/HPC platforms
  • standards-based solutions
  • DMTF Redfish
  • OpenBMC-based architecture
  • telemetry frameworks
  • platform serviceability flows
  • standards and open-source communities