Server Lab Engineer , Ml-il

Amazon Amazon · Big Tech · IL, Tel Aviv · Operations, IT, & Support Engineering

The Server Lab Engineer will own and operate the labs that power the bring-up and validation of next-generation ML training and inference racks. This role involves building, maintaining, and evolving lab infrastructure for HW, FW, and SW engineers, and delivering working, instrumented setups for R&D teams. Responsibilities include managing the physical lab, configuring hardware, administering Linux systems, managing network services, running sanity tests, writing automation scripts, and procuring/managing lab equipment.

What you'd actually do

  1. Own the MLIL hardware lab in the Tel-Aviv office: physical layout, power and cooling budget, network topology, cabling, asset tracking, and day-to-day operations.
  2. Build, configure, and connect new lab setups for HW, FW, and SW engineers — including Servers, GPU sleds, PCIe switches, retimers, NICs, and DRAM modules — and deliver them ready for R&D use.
  3. Administer and maintain Linux-based servers and systems, including installation, configuration, and optimization
  4. Manage and configure network services such as DHCP, PXE, and other critical infrastructure components.
  5. Run sanity tests on every delivered setup — boot, PCIe enumeration, basic DRAM check, network reachability — so R&D teams pick up a known-good baseline and can focus on their work.

Skills

Required

  • 3+ years experience as a System-Admin/Lab Engineer or in a similar role
  • Knowledge of Linux operating systems and server administration
  • Solid understanding of networking fundamentals — Ethernet, TCP/IP, link-layer debug, switch / NIC configuration.

Nice to have

  • Proven hands-on experience with lab instrumentation: scopes, logic analyzers, protocol analyzers, bench PSUs, JTAG / BMC debug.
  • B.Sc in Electrical / Electronics / Computer Engineering, or a Practical Engineer diploma (הנדסאי) with hands-on experience.
  • Solid understanding of PCIe — enumeration, link training, lane configuration, error reporting (AER), and common debug flows.
  • Experience with BMC / BIOS / UEFI debug, IPMI, Redfish.
  • Experience with high-speed serial debug — SerDes, equalization, eye diagrams, BER testing.
  • Proficient in Python / Bash automation and willing to write production-grade lab tooling.