Senior Production System Engineer - New York City

ByteDance ByteDance · Big Tech · New York, NY · Infrastructure

This role focuses on the engineering and operations of large-scale data center infrastructure and server fleets, including deployment, automation, monitoring, and lifecycle management. It is not directly involved in building or researching AI/ML models.

What you'd actually do

  1. Operation: As a Production Systems Engineer, your mission is to contribute to enhancing the stability, efficiency, effectiveness, and scalability of our data center and server operations, platform, and service on a worldwide scale.
  2. Lifecycle Enhancement: Participate in and enhance the entire lifecycle of the server fleet - from system design/introduction consultation to launch reviews, deployment, operation, and retirement.
  3. Automation: Develop and deploy tools and solutions to enhance the automation, reliability, scalability, and operability of servers in the datacenter.
  4. Monitoring: Develop and deploy tools and solutions for improving the availability, latency, and overall service of the datacenter infrastructure, server, and network health.
  5. Disaster Recovery: Troubleshoot and resolve complex technical issues in a fast-paced environment. Conduct high-level root-cause analysis for service interruption and establish preventive measures. Practice sustainable incident response and postmortem.

Skills

Required

  • Linux system administration
  • Linux kernels, drivers, and modules
  • Bash scripting
  • Python scripting
  • system configuration
  • performance tuning
  • security management
  • server hardware troubleshooting
  • large-scale data center operations
  • customizing operation and maintenance tools
  • monitoring server performance
  • resource provisioning
  • fault management
  • repairs
  • developing and maintaining hardware, network, or service monitoring software for more than 10,000 servers
  • managing and coordinating teams in the global context

Nice to have

  • GPU server operation and maintenance
  • full stack software development
  • RESTful APIs
  • Flask
  • JavaScript
  • Node.js
  • SQL
  • database schema design
  • database queries
  • data integrity
  • Redis
  • Ansible Configuration Management
  • Application Deployment
  • Task Execution