Research Engineer, Infrastructure

Cognition Cognition · Coding AI · San Francisco, CA · Research & Development

Research Engineer, Infrastructure role at Cognition, an applied AI lab building end-to-end software agents like Devin. The role focuses on building and owning the core systems that researchers depend on, including distributed training infrastructure, experiment orchestration, data pipelines, and tooling to accelerate research velocity. This involves ensuring systems are fast, reliable, and scalable for large-scale training jobs across thousands of GPUs, with a focus on performance optimization and parallelism strategies. The ideal candidate has deep experience in distributed systems, Python/C++, PyTorch, GPU profiling, and ML knowledge to engage with researchers.

What you'd actually do

  1. Build and own the systems that run large-scale training jobs reliably across GPU clusters. This includes job launchers, checkpointing and recovery, fault tolerance, and the monitoring that keeps researchers informed and unblocked.
  2. Own the infrastructure that runs hundreds of thousands of concurrent coding agent rollouts in VM sandboxes, from high-fidelity environment design to the distributed systems that hold up at our largest RL training scales.
  3. Profile and improve training throughput end to end. Identify bottlenecks across data loading, communication overhead, memory utilization, and compute efficiency. Implement solutions that meaningfully improve step time and MFU at scale.
  4. Design and maintain the systems researchers use to launch, track, and analyze experiments. Reduce friction in the research loop so that more time is spent on ideas and less on waiting.
  5. Build high-throughput, reliable data pipelines for training and evaluation. Ensure data quality, reproducibility, and efficiency at the scale our training runs demand.

Skills

Required

  • Python
  • C++
  • PyTorch
  • distributed systems
  • networking
  • storage
  • GPU performance profiling
  • memory optimization
  • compute efficiency
  • parallelism strategies (data, tensor, pipeline, sequence)
  • debugging complex distributed systems
  • ML knowledge

Nice to have

  • experience with RLHF
  • experience with agent orchestration
  • experience with vector databases
  • experience with RAG

What the JD emphasized

  • Deep experience building and operating distributed training systems for large models
  • Strong systems engineering fundamentals
  • Proficiency in Python and C++
  • Hands-on experience with GPU performance profiling, memory optimization, and compute efficiency
  • Experience implementing or optimizing parallelism strategies
  • Track record of building tooling and abstractions that meaningfully accelerate research workflows
  • Strong debugging instincts across complex, distributed systems
  • Enough ML knowledge to engage substantively with researchers

Other signals

  • building distributed training infrastructure
  • scaling agent rollouts
  • performance optimization for training throughput
  • experiment orchestration and tooling
  • data pipeline engineering for training and evaluation
  • implementing and optimizing parallelism strategies
  • scaling infrastructure ahead of research