Senior Software Engineer - Compute Infrastructure (cloud Native)

ByteDance ByteDance · Big Tech · Seattle, WA · R&D

Senior Software Engineer focused on building and optimizing large-scale Kubernetes-based compute infrastructure for diverse workloads, including AI and LLM applications. The role involves improving system performance, developing resource management and scheduling systems, and driving standardization for efficiency and reliability in cloud-native environments.

What you'd actually do

  1. Design and evolve the architecture of large-scale Kubernetes-based infrastructure platforms to ensure performance, scalability, and resilience for diverse workloads, including microservices, big data, and AI/LLM applications.
  2. Improve K8s system performance across the control and data planes, including optimizing pod lifecycle, resource orchestration, and system-level throughput under high load.
  3. Build robust observability and performance analysis frameworks, define K8s system-level SLOs, and lead data-driven tuning and optimization initiatives in production.
  4. Develop intelligent, unified resource management and scheduling systems (at node & cluster-level) to support a wide range of compute resources in large-scale, cloud-native environments.
  5. Drive the standardization and optimization of container runtime environments to enhance workload isolation, reliability, and resource efficiency across heterogeneous compute environments.

Skills

Required

  • Unix/Linux environments
  • distributed and parallel systems
  • high-performance networking systems
  • developing large scale software systems
  • Docker
  • Kubernetes
  • Python
  • Go
  • C++
  • Rust
  • Java

Nice to have

  • big data workflows in a Kubernetes environment
  • machine learning workflows in a Kubernetes environment
  • cloud-native open-source projects
  • containerized applications
  • Prometheus
  • Grafana
  • distributed tracing systems

What the JD emphasized

  • large-scale Kubernetes-based infrastructure platforms
  • AI/LLM applications
  • Improve K8s system performance
  • high load
  • large-scale, cloud-native environments
  • heterogeneous compute environments