Software Engineer - Compute Infrastructure (cloud Native)

ByteDance ByteDance · Big Tech · Seattle, WA · R&D

Software Engineer role focused on building and optimizing large-scale compute infrastructure (Kubernetes, Serverless) that supports AI/LLM workloads. The role involves improving system performance, developing resource management and scheduling systems, and driving standardization for efficiency and reliability. While the infrastructure supports AI, the core craft is in infrastructure engineering, not direct AI/ML model development.

What you'd actually do

  1. Design and evolve the architecture of large-scale Kubernetes-based infrastructure platforms to ensure performance, scalability, and resilience for diverse workloads, including microservices, big data, and AI/LLM applications.
  2. Improve K8s system performance across the control and data planes, including optimizing pod lifecycle, resource orchestration, and system-level throughput under high load.
  3. Build robust observability and performance analysis frameworks, define K8s system-level SLOs, and lead data-driven tuning and optimization initiatives in production.
  4. Develop intelligent, unified resource management and scheduling systems (at node & cluster-level) to support a wide range of compute resources in large-scale, cloud-native environments.
  5. Drive the standardization and optimization of container runtime environments to enhance workload isolation, reliability, and resource efficiency across heterogeneous compute environments.

Skills

Required

  • B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 3+ years of relevant industry experience
  • Solid understanding of at least one of the following fields: Unix/Linux environments, distributed and parallel systems, high-performance networking systems, developing large scale software systems
  • Familiarity with container and orchestration technologies such as Docker and Kubernetes.
  • Proficiency in at least one major programming language such as Python, Go, C++, Rust, and Java.

Nice to have

  • Knowledge of big data or machine learning workflows in a Kubernetes environment.
  • Experience in developing or contributing to cloud-native open-source projects.
  • Hands-on project experience with containerized applications through internships, coursework, or personal projects.
  • Familiarity with observability tools and frameworks like Prometheus, Grafana, or distributed tracing systems.

What the JD emphasized

  • large-scale
  • AI and LLM workloads
  • optimize our infrastructure for AI & LLM models
  • resource cost efficiency
  • scale and optimize our infrastructure globally