Senior Production System Engineer - San Jose

ByteDance ByteDance · Big Tech · San Jose, CA · Infrastructure

This role focuses on the engineering and operations of large-scale data center infrastructure, including server lifecycle management, automation, monitoring, and disaster recovery. It is a core engineering role within ByteDance's Data Systems Infrastructure team, responsible for the reliability and scalability of their global data centers.

What you'd actually do

  1. Operation: As a Production Systems Engineer, your mission is to contribute to enhancing the stability, efficiency, effectiveness, and scalability of our data center and server operations, platform, and service on a worldwide scale.
  2. Lifecycle Enhancement: Participate in and enhance the entire lifecycle of the server fleet - from system design/introduction consultation to launch reviews, deployment, operation, and retirement.
  3. Automation: Develop and deploy tools and solutions to enhance the automation, reliability, scalability, and operability of servers in the datacenter.
  4. Monitoring: Develop and deploy tools and solutions for improving the availability, latency, and overall service of the datacenter infrastructure, server, and network health.
  5. Disaster Recovery: Troubleshoot and resolve complex technical issues in a fast-paced environment. Conduct high-level root-cause analysis for service interruption and establish preventive measures. Practice sustainable incident response and postmortem.

Skills

Required

  • Linux system administration
  • Linux kernels, drivers, and modules
  • Bash scripting
  • Python scripting
  • system configuration
  • performance tuning
  • security management
  • server hardware troubleshooting
  • large-scale data center operations
  • customizing operation and maintenance tools
  • managing software tool lifecycle
  • server performance monitoring
  • resource provisioning
  • fault management
  • repairs
  • managing and coordinating global teams

Nice to have

  • GPU server operation and maintenance
  • full stack software development
  • RESTful APIs
  • Flask
  • JavaScript
  • Node.js
  • SQL
  • database schema design
  • database querying
  • data integrity
  • Redis
  • Ansible Configuration Management
  • Application Deployment
  • Task Execution