Member of Technical Staff, Multimodal Infrastructure - Mai Superintelligence Team

Microsoft Microsoft · Big Tech · Mountain View, CA +4 · Software Engineering

This role focuses on building and maintaining large-scale infrastructure for multimodal generative models, covering the full development cycle from data processing to training, inference, and serving. It involves working with research scientists and product engineers to optimize performance and drive architectural changes for consumer AI products like Copilot.

What you'd actually do

  1. Design, develop and maintain large-scale multimodal data processing pipelines.
  2. Design, develop and maintain large-scale multimodal model pretraining and post-training frameworks.
  3. Design, develop and maintain large-scale multimodal model inference and serving frameworks.
  4. Work with research scientists and product engineers to solve infra-related problems.
  5. Find a path to get things done despite roadblocks to get your work into the hands of users quickly and iteratively.

Skills

Required

  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • equivalent experience

Nice to have

  • Bachelor's Degree in Computer Science or related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • equivalent experience
  • Experience in multi-modal data processing
  • Strong proficiency in distributed data processing infra (resource utilization management, fault tolerance, ray & spark) and CPU/GPU batch processing optimizations
  • Experience with state-of-art model inference and serving frameworks
  • Experience with image/video/audio data processing
  • Experience with common data formats for efficient I/O
  • Experience in multi-modal pretraining and post-training
  • Strong proficiency in deep learning frameworks such as PyTorch, Megatron and Deepspeed
  • Knowledge of auto-regressive and diffusion transformer models
  • Experience with distributed training techniques such as data parallelism, model parallelism, and pipeline parallelism
  • Proven experiences in at least one of the following areas: image/video generation and editing; efficient architectures (e.g., MoE, window attention); efficient model design; or reinforcement learning training methods (e.g., RLHF, DPO, GRPO)
  • Experience in multi-modal inference and serving
  • Strong proficiency in serving frameworks such as vLLM, TensorRT-LLM, SGLang, xDiT, Cache-DiT etc.
  • Knowledge of distillation techniques such as Progressive Distillation, DMD, Self forcing etc.
  • Knowledge of quantization and compression techniques like AWQ, GPTQ, and FP8 for multi-modal pipelines
  • Experience in distributed inference scaling across multi-node clusters using Ray Serve and Triton
  • Experience in leading technical projects and supporting architectural decisions with data

What the JD emphasized

  • multimodal data processing
  • multimodal pretraining and post-training
  • multimodal inference and serving

Other signals

  • building large-scale infrastructures
  • multimodal generative model development
  • full cycle of multimodal generative model development