Member of Technical Staff - Multimodal Understanding

xAI xAI · AI Frontier · Palo Alto, CA · Model

xAI is seeking a Member of Technical Staff for their Multimodal Understanding team to advance superhuman multimodal intelligence. The role involves working across the full stack of multimodal AI, from data curation and pre-training to post-training, inference, evaluation, and end-to-end product experiences. Responsibilities include designing and optimizing large-scale distributed systems, developing data pipelines, advancing multimodal capabilities, creating evaluation frameworks, and innovating on algorithms and scaling paradigms. The role requires hands-on experience with multimodal pre-training/post-training/fine-tuning, proficiency in Python and ML frameworks, and proven track record in building large-scale distributed ML systems and data pipelines.

What you'd actually do

  1. Design, build, and optimize large-scale distributed systems for multimodal pre-training, post-training, inference, data processing, and tokenization at web/petabyte scale.
  2. Develop high-throughput pipelines for data acquisition, preprocessing, filtering, generation, decoding, loading, crawling, visualization, and management (images, videos, audio + text).
  3. Advance multimodal capabilities including spatial-temporal compression, cross-modal alignment, world modeling, reasoning, emergent abilities, audio/image/video understanding & generation, real-time video processing, and noisy data handling.
  4. Drive data quality and studies: curation (human/synthetic), filtering techniques, analysis, and scalable pipelines to support trillion-parameter models.
  5. Create evaluation frameworks, internal benchmarks, reward models, and metrics that capture real-world usage, failure modes, interactive dynamics, and human-AI synergy.

Skills

Required

  • Hands-on experience with multimodal pre-training, post-training, or fine-tuning (vision, audio, video, or cross-modal).
  • Expert-level proficiency in Python (core language), with strong experience in at least one of: JAX / PyTorch / XLA.
  • Proven track record building or optimizing large-scale distributed ML systems (training/inference optimization, GPU utilization, multi-GPU/TPU setups, hardware co-design).
  • Deep experience designing and running data pipelines at scale: curation, filtering, generation, quality studies, especially for noisy/real-world multimodal data.
  • Strong fundamentals in evaluation design, benchmarks, reward modeling, or RL techniques (particularly for interactive/agentic behaviors).
  • Proactive self-starter who thrives in high-intensity environments and is passionate about pushing multimodal AI frontiers.
  • Willingness to own end-to-end initiatives and do whatever it takes to deliver breakthrough user experiences.

Nice to have

  • Experience leading major improvements in model capabilities through better data, modeling, algorithms, or scaling.
  • Familiarity with state-of-the-art in multimodal LLMs, scaling laws, tokenizers, compression techniques, reasoning, or agentic systems.
  • Proficiency in Rust and/or C++ for performance-critical components.
  • Hands-on work with large-scale orchestration tools such as Spark, Ray, or Kubernetes.
  • Background building full-stack tooling: performant interfaces, real-time research demos/apps, or end-to-end product ownership.
  • Passion for end-to-end user experience in interactive, real-time multimodal AI systems.

What the JD emphasized

  • web/petabyte scale
  • trillion-parameter models
  • real-time video processing
  • noisy data handling
  • human/synthetic curation
  • scalable pipelines
  • real-world usage
  • failure modes
  • interactive dynamics
  • human-AI synergy
  • state-of-the-art performance
  • user-friendly interfaces
  • full-stack applications
  • rapid iteration
  • reasoning
  • tool calling
  • agentic behaviors
  • orchestration
  • seamless real-time interactions
  • multimodal pre-training
  • post-training
  • fine-tuning
  • vision
  • audio
  • video
  • cross-modal
  • Python
  • JAX
  • PyTorch
  • XLA
  • large-scale distributed ML systems
  • training/inference optimization
  • GPU utilization
  • multi-GPU/TPU setups
  • hardware co-design
  • data pipelines at scale
  • curation
  • filtering
  • generation
  • quality studies
  • noisy/real-world multimodal data
  • evaluation design
  • benchmarks
  • reward modeling
  • RL techniques
  • interactive/agentic behaviors
  • push multimodal AI frontiers
  • end-to-end initiatives
  • breakthrough user experiences
  • major improvements in model capabilities
  • multimodal LLMs
  • scaling laws
  • tokenizers
  • compression techniques
  • reasoning
  • agentic systems
  • Rust
  • C++
  • Spark
  • Ray
  • Kubernetes
  • full-stack tooling
  • performant interfaces
  • real-time research demos/apps
  • end-to-end product ownership
  • interactive, real-time multimodal AI systems

Other signals

  • multimodal intelligence
  • large-scale pre-training
  • post-training/alignment
  • inference
  • evaluation
  • tooling/demos
  • end-to-end product experiences
  • frontier capabilities
  • world modeling
  • tool use
  • agentic behaviors
  • interactive human-AI collaboration
  • real time video processing
  • trillion-parameter models
  • reward models
  • interactive dynamics
  • human-AI synergy
  • scaling paradigms
  • research tooling
  • rapid iteration
  • reasoning
  • orchestration
  • seamless real-time interactions
  • web/petabyte scale
  • trillion-parameter models
  • noisy data handling
  • human/synthetic curation
  • scalable pipelines
  • real-world usage
  • failure modes
  • human-AI synergy
  • state-of-the-art performance
  • user-friendly interfaces
  • full-stack applications
  • rapid iteration
  • reasoning
  • tool calling
  • agentic behaviors
  • orchestration
  • seamless real-time interactions
  • multimodal pre-training
  • post-training
  • fine-tuning
  • vision
  • audio
  • video
  • cross-modal
  • Python
  • JAX
  • PyTorch
  • XLA
  • large-scale distributed ML systems
  • training/inference optimization
  • GPU utilization
  • multi-GPU/TPU setups
  • hardware co-design
  • data pipelines at scale
  • curation
  • filtering
  • generation
  • quality studies
  • noisy/real-world multimodal data
  • evaluation design
  • benchmarks
  • reward modeling
  • RL techniques
  • interactive/agentic behaviors
  • push multimodal AI frontiers
  • end-to-end initiatives
  • breakthrough user experiences
  • major improvements in model capabilities
  • multimodal LLMs
  • scaling laws
  • tokenizers
  • compression techniques
  • reasoning
  • agentic systems
  • Rust
  • C++
  • Spark
  • Ray
  • Kubernetes
  • full-stack tooling
  • performant interfaces
  • real-time research demos/apps
  • end-to-end product ownership
  • interactive, real-time multimodal AI systems