Software Engineer - Compute Infrastructure (cloud Native)

ByteDance ByteDance · Big Tech · San Jose, CA · R&D

This role focuses on building and optimizing large-scale compute infrastructure using Kubernetes and Serverless technologies, specifically for AI and LLM workloads. The engineer will improve system performance, develop resource management systems, and enhance observability for these demanding applications.

What you'd actually do

  1. Design and evolve the architecture of large-scale Kubernetes-based infrastructure platforms to ensure performance, scalability, and resilience for diverse workloads, including microservices, big data, and AI/LLM applications.
  2. Improve K8s system performance across the control and data planes, including optimizing pod lifecycle, resource orchestration, and system-level throughput under high load.
  3. Build robust observability and performance analysis frameworks, define K8s system-level SLOs, and lead data-driven tuning and optimization initiatives in production.
  4. Develop intelligent, unified resource management and scheduling systems (at node & cluster-level) to support a wide range of compute resources in large-scale, cloud-native environments.
  5. Drive the standardization and optimization of container runtime environments to enhance workload isolation, reliability, and resource efficiency across heterogeneous compute environments.

Skills

Required

  • Kubernetes
  • Docker
  • Python
  • Go
  • C++
  • Rust
  • Java
  • Unix/Linux environments
  • distributed and parallel systems
  • high-performance networking systems
  • developing large scale software systems

Nice to have

  • big data workflows
  • machine learning workflows
  • cloud-native open-source projects
  • Prometheus
  • Grafana
  • distributed tracing systems

What the JD emphasized

  • large-scale
  • AI and LLM workloads
  • optimize our infrastructure for AI & LLM models
  • scale and optimize our infrastructure globally