Sr.system Development Engineer, Agi Infrastructure

Amazon Amazon · Big Tech · IN, TN +1 · Systems, Quality, & Security Engineering

The AGI team is seeking engineers to develop and maintain multi-modal and multi-lingual LLMs using scalable training and inference systems. The role involves deeply understanding technology landscapes, evaluating new technologies, and driving operational excellence. Key responsibilities include leading the design and automation of GenAI training compute infrastructure, mentoring engineers, identifying performance bottlenecks, and working with core AWS services, CI/CD pipelines, and Kubernetes.

What you'd actually do

  1. Lead design, automation and improve GenAI training compute infrastructure continuously.
  2. Guide/Mentor other engineers as force-multiplier to deliver results
  3. Participate in design and code reviews and identify bottlenecks.
  4. Identify performance bottlenecks in compute infrastructure and propose solutions to address them.

Skills

Required

  • systems design
  • software development
  • operations
  • automation
  • process improvement
  • Python
  • Ruby
  • Golang
  • Java
  • C++
  • C#
  • Rust
  • Linux/Unix
  • CI/CD pipelines build processes
  • Kubernetes

Nice to have

  • AWS services (EC2, Lambda, EKS)
  • AWS CodePipeline
  • GitHub Actions
  • AWS CloudFormation
  • Terraform
  • AWS CDK
  • networking concepts (VPC, subnets, security groups)
  • Load Balancers
  • Route 53
  • distributed systems at scale

What the JD emphasized

  • deeply understand technology landscapes
  • evaluate the use of new technologies
  • drive your team to push for improvements that can scale across other teams, services, and platforms
  • 6+ years of systems design, software development, operations, automation, and process improvement experience

Other signals

  • AGI Infrastructure
  • multi-modal and multi-lingual large language models (LLM)
  • large model training and inference systems
  • sensory AI foundational models