Mts Software System Design Engineer

AMD AMD · Semiconductors · Austin, TX · Engineering

This role focuses on designing, testing, and validating reference architectures for large-scale AI training and inference clusters. The engineer will develop tools for efficient cluster management, create documentation, serve as a technical interface with customers, conduct proof-of-concept implementations, and evaluate/benchmark infrastructure performance. Expertise in Kubernetes, AI/ML workloads, datacenter networking, GPU computing, and performance optimization for inference deployments is required.

What you'd actually do

  1. design, test, and validate reference architectures for large-scale AI training and inference clusters
  2. Develop comprehensive tools for AI training to enable efficient cluster management
  3. Create detailed reference documentation and implementation guides for customers and internal teams
  4. Serve as the primary technical interface with customer engineering teams during deployment planning
  5. Conduct proof-of-concept implementations to validate designs in real-world scenarios

Skills

Required

  • Designing and implementing large-scale infrastructure solutions
  • Kubernetes and container orchestration technologies
  • AI/ML workloads in production environments
  • Datacenter networking and storage architectures
  • GPU/AI-accelerated computing environments
  • Creating technical documentation and reference architectures
  • Infrastructure automation and orchestration tools
  • Performance optimization for large-scale inference deployments
  • Ray, PyTorch, and HPC optimized schedulers for Kubernetes based AI training
  • SLURM or similar HPC schedulers
  • Infrastructure-as-code tools such as Terraform or Ansible
  • Performance tuning for GPU/AI-accelerated workloads
  • Creating automation tools for infrastructure deployment

What the JD emphasized

  • large-scale AI training and inference clusters
  • large-scale inference deployments
  • large-scale infrastructure solutions
  • Kubernetes and container orchestration technologies
  • AI/ML workloads in production environments
  • GPU/AI-accelerated computing environments
  • Performance optimization for large-scale inference deployments

Other signals

  • designing and implementing large-scale infrastructure solutions
  • Kubernetes and container orchestration technologies
  • AI/ML workloads in production environments
  • GPU/AI-accelerated computing environments
  • Performance optimization for large-scale inference deployments