Sr. Fellow Machine Learning Engineer

AMD AMD · Semiconductors · San Jose, CA · Engineering

AMD is seeking a Sr. Fellow Machine Learning Engineer to join their Training At Scale team, focusing on the distributed training of large generative AI models on a large number of GPUs. The role involves improving training efficiency, performance, and debuggability of end-to-end training pipelines, and optimizing distributed training software stacks. The candidate will contribute to open source, stay updated on training algorithms, and influence AMD's AI platform direction.

What you'd actually do

  1. Train large models to convergence on AMD GPUs at scale.
  2. Improve the end-to-end training pipeline performance on large scale GPU cluster.
  3. Improve the end-to-end debuggability on large scale GPU cluster.
  4. Design and optimize the distributed training pipeline and software stack to scale out.
  5. Contribute your changes to open source.

Skills

Required

  • distributed training pipelines
  • distributed training algorithms (Data Parallel, Tensor Parallel, Pipeline Parallel, Expert Parallel)
  • training large models
  • Python
  • C++
  • performance profiling
  • debugging
  • large-scale optimization
  • ML frameworks (PyTorch, JAX, TensorFlow)
  • distributed frameworks (TorchTitan, Megatron-LM)

Nice to have

  • machine learning
  • distributed systems
  • AI infrastructure
  • model and application-level development and optimization
  • LLMs
  • recommendation systems
  • ranking models
  • collaborating across hardware, compiler, and system software layers

What the JD emphasized

  • distributed training
  • large models
  • large scale GPU cluster
  • training efficiency
  • distributed training algorithms
  • training large models
  • distributed training systems
  • large models
  • large-scale optimization

Other signals

  • distributed training
  • large models
  • GPU cluster
  • training efficiency