Member of Technical Staff, Evaluations Engineering - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · Mountain View, CA +4 · Software Engineering

This role focuses on building and scaling the evaluation infrastructure for generative AI models on large-scale GPU clusters. It involves developing sophisticated tools and techniques for reliability, performance, and health monitoring, and collaborating with model scientists on evaluation methods and inference strategies. The role also touches on pretraining software development and benchmarking.

What you'd actually do

  1. Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures.
  2. Benchmark GB200 and AMD MIxxx GPU clusters.
  3. Gather data and insights to develop the pretraining compute roadmap.
  4. Care deeply about conversational AI and its deployment.
  5. Actively contribute to the development of AI models that are powering our innovative products.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience
  • coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python

Nice to have

  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience
  • Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience
  • Experience with generative AI.
  • Experience with distributed computing.
  • Experience in leading technical projects and supporting architectural decisions with data.

What the JD emphasized

  • design and build the evaluation infrastructure for generative AI on large-scale GPU clusters
  • develop sophisticated tools and techniques to ensure the reliability, performance, and health of hundreds of nodes across supercomputers with thousands of GPUs
  • implement state-of-the-art and novel evaluation methods, inference strategies, and metrics algorithms

Other signals

  • design and build the evaluation infrastructure for generative AI on large-scale GPU clusters
  • develop sophisticated tools and techniques to ensure the reliability, performance, and health of hundreds of nodes across supercomputers with thousands of GPUs
  • collaborate closely with model scientists to implement state-of-the-art and novel evaluation methods, inference strategies, and metrics algorithms