Member of Technical Staff, AI Training Infrastructure

Fireworks AI · Data AI · New York, NY +1 · Engineering

The role focuses on designing, building, and optimizing the infrastructure for large-scale model training operations, including distributed training pipelines, performance optimization, and data storage solutions for LLMs and multimodal models.

What you'd actually do

  1. Design and implement scalable infrastructure for large-scale model training workloads
  2. Develop and maintain distributed training pipelines for LLMs and multimodal models
  3. Optimize training performance across multiple GPUs, nodes, and data centers
  4. Implement monitoring, logging, and debugging tools for training operations
  5. Architect and maintain data storage solutions for large-scale training datasets

Skills

Required

  • Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience
  • 3+ years of experience with distributed systems and ML infrastructure
  • Experience with PyTorch
  • Proficiency in cloud platforms (AWS, GCP, Azure)
  • Experience with containerization, orchestration (Kubernetes, Docker)
  • Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP)

Nice to have

  • Master's or PhD in Computer Science or related field
  • Experience training large language models or multimodal AI systems
  • Experience with ML workflow orchestration tools
  • Background in optimizing high-performance distributed computing systems
  • Familiarity with ML DevOps practices
  • Contributions to open-source ML infrastructure or related projects

What the JD emphasized

  • 3+ years of experience with distributed systems and ML infrastructure
  • Experience training large language models or multimodal AI systems

Other signals

  • large-scale model training infrastructure
  • distributed training pipelines
  • optimize training performance