Senior Production System Engineer - Ashburn

ByteDance ByteDance · Big Tech · Ashburn, VA · Infrastructure

This role focuses on the engineering and operations of large-scale data center infrastructure and server fleets, including deployment, monitoring, automation, and lifecycle management. It involves ensuring the stability, efficiency, and scalability of production systems.

What you'd actually do

  1. Operation: As a Production Systems Engineer, your mission is to contribute to enhancing the stability, efficiency, effectiveness, and scalability of our data center and server operations, platform, and service on a worldwide scale.
  2. Lifecycle Enhancement: Participate in and enhance the entire lifecycle of the server fleet - from system design/introduction consultation to launch reviews, deployment, operation, and retirement.
  3. Automation: Develop and deploy tools and solutions to enhance the automation, reliability, scalability, and operability of servers in the datacenter.
  4. Monitoring: Develop and deploy tools and solutions for improving the availability, latency, and overall service of the datacenter infrastructure, server, and network health.
  5. Disaster Recovery: Troubleshoot and resolve complex technical issues in a fast-paced environment. Conduct high-level root-cause analysis for service interruption and establish preventive measures. Practice sustainable incident response and postmortem.

Skills

Required

  • Linux system administration
  • Linux kernels, drivers, and modules
  • Bash scripting
  • Python scripting
  • system configuration
  • performance tuning
  • security management
  • server hardware troubleshooting/diagnostics
  • large-scale data center operations
  • customizing operation and maintenance tools
  • managing software tool lifecycle
  • monitoring server performance
  • provisioning resources
  • fault management
  • repairs
  • developing and maintaining hardware, network, or service monitoring software for more than 10,000 servers
  • managing and coordinating teams in the global context

Nice to have

  • Full Stack Software Development
  • RESTful APIs
  • Flask
  • JavaScript
  • Node.js
  • SQL
  • Redis
  • Ansible Configuration Management
  • Application Deployment
  • Task Execution
  • GPU server operation and maintenance