Software Engineer II - Michelangelo

Uber Uber · Consumer · Seattle, WA +1 · Engineering

Software Engineer II on Uber's ML Training team, responsible for designing and building core components of large-scale distributed training systems for multi-GPU/TPU environments within the Michelangelo AI platform. Focuses on ML infrastructure and enabling efficient, reliable, and scalable model development.

What you'd actually do

  1. Design, build, and maintain components of distributed training systems for multi-GPU/TPU environments.
  2. Implement features and improvements for ML training infrastructure and platform services.
  3. Collaborate with ML engineers and data scientists to support model development and deployment workflows.
  4. Write clean, efficient, and maintainable code with proper testing and documentation.
  5. Debug and resolve issues in distributed systems and ML pipelines with guidance from senior engineers.

Skills

Required

  • Python
  • Java
  • Go
  • C++
  • building software systems or services
  • distributed systems fundamentals
  • ML/DL frameworks (e.g., PyTorch, TensorFlow, JAX)
  • ML workflows

Nice to have

  • distributed systems
  • cloud-based infrastructure
  • ML infrastructure
  • training workflows
  • distributed training technologies (e.g., DDP, FSDP, DeepSpeed)
  • GPU/TPU environments
  • accelerator hardware
  • data processing systems (e.g., Spark, Ray)
  • performance optimization
  • system efficiency
  • debugging
  • problem-solving

What the JD emphasized

  • distributed training systems
  • multi-GPU/TPU environments
  • ML infrastructure
  • distributed systems fundamentals
  • ML/DL frameworks
  • distributed training technologies

Other signals

  • ML Training team
  • large-scale distributed training systems
  • multi-GPU/TPU environments
  • ML infrastructure