Senior Member of Technical Staff, Multimodal AI

Cohere Cohere · AI Frontier · San Francisco, CA · Modeling

Cohere is seeking a Senior Member of Technical Staff to focus on Multimodal AI. This role involves designing and developing cutting-edge multimodal AI systems integrating text, speech, and vision. The candidate will conduct research and experiments on advanced compute infrastructure, exploring novel ideas in multimodal representation learning and transfer learning. The role requires strong software engineering skills, proficiency in Python and deep learning frameworks (JAX, PyTorch, TensorFlow), and knowledge of distributed training strategies for large-scale multimodal models. Experience with autoregressive models for tasks like image/video captioning and speech-to-text is beneficial. The ideal candidate enjoys tuning and optimizing large multimodal models and building evaluations to measure their performance.

What you'd actually do

  1. Design and develop cutting-edge multimodal AI systems, integrating various modalities such as text, speech, and vision.
  2. Conduct research and experiments on our advanced compute infrastructure, exploring novel ideas in multimodal representation learning, transfer learning, and more.
  3. Collaborate closely with our world-class teams, learning from and contributing to their expertise in the field.

Skills

Required

  • Python
  • JAX
  • PyTorch
  • TensorFlow
  • distributed training
  • autoregressive models
  • image captioning
  • video captioning
  • speech-to-text generation
  • ML codebases

Nice to have

  • Publications in top-tier venues
  • CUDA
  • GPU kernels

What the JD emphasized

  • exceptional software engineering skills
  • proven track record of building robust and scalable systems
  • strong command of Python
  • well-versed in popular deep learning frameworks like JAX, PyTorch, and TensorFlow
  • knowledge of distributed training strategies, especially for large-scale multimodal models
  • familiarity with autoregressive models, particularly their application in multimodal tasks such as image or video captioning, speech-to-text generation
  • tuning and optimising large multimodal models
  • experience building evaluations to measure their performance
  • comfortable diving into complex ML codebases to identify and resolve issues
  • thrive in a fast-paced, technically challenging environment
  • history of delivering creative, practical solutions to real-world problems

Other signals

  • training frontier models
  • multimodal AI
  • vision-language model
  • outperforms major models