Systems Integration Engineer

Agility Robotics Agility Robotics · Robotics · Fremont, CA · Hardware

Seeking a Systems Integration Engineer specializing in Software Issue Triage and Root Cause Analysis (RCA) for Agility Robotics' deployed humanoid robots. The role involves remote triage using logs, telemetry, and video to identify software failures, conducting deep-dive RCA on hardware-software interface issues, and architecting diagnostic tools. Responsibilities include classifying failures, dispositioning issues to SW teams, leading investigations, developing automated diagnostic scripts, and creating technical documentation to improve software reliability across the fleet.

What you'd actually do

  1. Serve as a lead voice in the triage process, providing the expertise required to classify complex failures specifically as software, firmware, or system-level regressions.
  2. Effectively disposition identified issues to the software organization, providing clean tickets (logs, video clips, and analysis) that allow developers to act quickly.
  3. Lead end-to-end investigations into novel failures using deep-dive log review, telemetry analysis, and video diagnostics to pinpoint bugs at the software/hardware interface or unexpected system behaviors.
  4. Develop and execute scripts or other data visualization tools to parse massive log sets and identify intermittent failure trends.
  5. Author and maintain "Gold Standard" RCA reports and troubleshooting guides that improve the technical autonomy of the broader triage team.

Skills

Required

  • 4+ years of experience in Systems Integration, Software-Hardware interface, or R&D with a focus on software on complex mechatronic or autonomous systems.
  • Proven experience using monitoring and observability platforms (e.g., Datadog, Splunk, or New Relic) to track system health and identify performance anomalies across a fleet.
  • Experience interacting with cloud-based storage and databases (e.g., AWS S3, SQL, or NoSQL) to retrieve and manage large-scale telemetry and video datasets.
  • Proven track record of navigating highly ambiguous software-hardware intersections to find definitive root causes.
  • Experience creating technical documentation or bug reports intended for software engineering audiences.
  • Mastery of log parsing via CLI and proficiency in using Python or similar scripting languages for data visualization and failure trend analysis.
  • Familiarity with database environments, specifically regarding data retrieval and log management.
  • Experience correlating video and/or HW symptoms with system telemetry to identify physical manifestations of software bugs.
  • Strong understanding of software stacks in robotics, including communication protocols (e.g., EtherCAT, CAN) and how they manifest in system logs.
  • Ability to tackle ambiguous, unprecedented problems and create reusable, scalable solutions.
  • Capacity to operate independently on initiatives and proactively anticipate the needs for effective and efficient triage and RCA.
  • Exceptional ability to synthesize complex telemetry and video data into clear, actionable insights for software engineering stakeholders.
  • Bachelor’s or Master’s degree in Computer Science, Robotics, Electrical Engineering, or a related field.

Nice to have

  • Experience with HW/SW integration and design on HiL.
  • Experience with characterizing or troubleshooting HW/SW interactions such as cameras, encoders, IMUs, or other sensors.

What the JD emphasized

  • software root causes
  • novel failures
  • ambiguous failure modes
  • software reliability
  • software-hardware interface
  • ambiguous software-hardware intersections
  • unprecedented problems