Senior Software Engineer - Compute Infrastructure (cloud Native)

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

Senior Software Engineer role focused on building and optimizing large-scale Kubernetes and Serverless compute infrastructure that powers AI/LLM workloads. The role involves designing, improving, and managing resource orchestration, scheduling, and observability for these systems, with a focus on cost efficiency and performance. The team also contributes to open-source projects.

What you'd actually do

  1. Design and evolve the architecture of large-scale Kubernetes-based infrastructure platforms to ensure performance, scalability, and resilience for diverse workloads, including microservices, big data, and AI/LLM applications.
  2. Improve K8s system performance across the control and data planes, including optimizing pod lifecycle, resource orchestration, and system-level throughput under high load.
  3. Build robust observability and performance analysis frameworks, define K8s system-level SLOs, and lead data-driven tuning and optimization initiatives in production.
  4. Develop intelligent, unified resource management and scheduling systems (at node & cluster-level) to support a wide range of compute resources in large-scale, cloud-native environments.
  5. Drive the standardization and optimization of container runtime environments to enhance workload isolation, reliability, and resource efficiency across heterogeneous compute environments.

Skills

Required

  • B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 3+ years of relevant industry experience
  • Solid understanding of at least one of the following fields: Unix/Linux environments, distributed and parallel systems, high-performance networking systems, developing large scale software systems
  • Familiarity with container and orchestration technologies such as Docker and Kubernetes.
  • Proficiency in at least one major programming language such as Python, Go, C++, Rust, and Java.

Nice to have

  • Knowledge of big data or machine learning workflows in a Kubernetes environment.
  • Experience in developing or contributing to cloud-native open-source projects.
  • Hands-on project experience with containerized applications through internships, coursework, or personal projects.
  • Familiarity with observability tools and frameworks like Prometheus, Grafana, or distributed tracing systems.

What the JD emphasized

  • AI and LLM workloads
  • optimize our infrastructure for AI & LLM models
  • support the most demanding AI/LLM workloads
  • large-scale Kubernetes-based infrastructure platforms
  • Improve K8s system performance
  • resource orchestration
  • intelligent, unified resource management and scheduling systems
  • large-scale, cloud-native environments

Other signals

  • AI and LLM workloads
  • optimize our infrastructure for AI & LLM models
  • support the most demanding AI/LLM workloads
  • powering global platforms like TikTok and various AI/ML & LLM initiatives