Devops Software Development Engineer

AMD AMD · Semiconductors · Shanghai, China · Engineering

Software Development Engineer on the AI/ML Frameworks team responsible for building and maintaining scalable DevOps infrastructure, CI/CD pipelines, and Kubernetes-based GPU environments to accelerate AI software development. The role involves automating systems using Python, Go, and Ansible, and requires debugging ML framework source code (PyTorch, TensorFlow, ROCm).

What you'd actually do

  1. Build System Expertise & Issue Triaging: Develop deep expertise in build tools and flows (CMake, Bazel, Make, compiler toolchains). Triage complex build failures by understanding the full build pipeline—from source to binary. Identify root causes across infrastructure, toolchain, and code-level issues.
  2. Team Training & Knowledge Sharing: Train and mentor team members on build systems, CI/CD workflows, and debugging techniques. Create documentation, runbooks, and training sessions to ensure the team can effectively triage issues independently. Foster a culture of continuous learning around build infrastructure.
  3. ML Framework Integration & Code Contribution: Understand the architecture and codebase of ML frameworks (PyTorch, TensorFlow, ROCm stack). Review, debug, and contribute code changes as needed to resolve build issues, improve CI reliability, or support new features.
  4. Tooling & Automation Development: Design and develop internal tools, automation scripts, and services primarily in Python and Go. Write well-tested, production-grade code to solve infrastructure and workflow challenges.
  5. CI/CD Pipeline Development: Design, implement, and manage efficient continuous integration and delivery pipelines using Buildkite, GitHub Actions, and Jenkins to enable rapid and reliable software deployment for ML workloads.

Skills

Required

  • DevOps/infrastructure engineering
  • Python
  • Go
  • Ansible
  • Kubernetes
  • CI/CD tools
  • Build systems
  • Toolchains
  • CMake
  • Bazel
  • compiler toolchains
  • PyTorch
  • TensorFlow
  • ROCm
  • Buildkite
  • GitHub Actions
  • Jenkins
  • Docker
  • Helm
  • MySQL
  • Grafana

Nice to have

  • C++
  • JAX

What the JD emphasized

  • understanding how CMake, Bazel, and compiler toolchains work is critical
  • understand ML framework source code
  • understand the architecture and codebase of ML frameworks

Other signals

  • AI/ML Frameworks team
  • DevOps infrastructure that accelerates AMD’s AI software development
  • Kubernetes‑based GPU environments
  • ML framework source code (PyTorch, TensorFlow, ROCm)
  • ML workloads
  • ML framework developers