Principal Software Engineer - Dpu Integrations

Microsoft Microsoft · Big Tech · Santa Clara, CA +1 · Software Engineering

Principal Software Engineer focused on designing, building, testing, and deploying automation tools and AI agents for DPU control and data plane software in Azure hardware infrastructure. The role involves technical leadership, debugging field issues, maintaining dashboards, and collaborating cross-functionally to improve product quality and reduce mitigation time for production issues.

What you'd actually do

  1. Design, build, test and deploy innovative integration and diagnostics tools, and AI agents to release quality products and reduce time to mitigate production issues.
  2. Provides technical leadership to teams to identify the scope of testing to create a quality plan for DPU based compute products. In partnership with key stakeholders creates and manages project schedules.
  3. Leads the team by providing technical expertise and oversight, monitors test plan execution and quality to ensure that testing is efficient and executed according to plans.
  4. Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions,
  5. Alerting stakeholders about status and initiating actions to restore system/product/service for simple and complex problems when appropriate.

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python

Nice to have

  • 10+ Years of previous experience in developing, testing, diagnosing and troubleshooting networking, storage or compute cloud platforms as a lead engineer owning releases and mentoring/guiding a team of engineers.
  • Experience with Azure or similar large scale cloud computing infrastructure, control plane, telemetry, monitoring, diagnostics, reporting
  • Experience developing and/or testing embedded software for NICs and/or DPUs/IPUs.
  • Understanding and hands on experience with networking (TCP/IP, RoceV2, routing/switching), Software Defined Networking, and server platform firmware (BMC, BIOS etc) testing.
  • Experience with complex debug/troubleshooting in both lab and live site situations.
  • Experience with dealing with large-scale data analysis to identify themes and root causes of issues
  • Experience with AI agents to do live site tool automation and analysis

What the JD emphasized

  • AI agents
  • debug field issues
  • live site automation
  • automation tools
  • diagnostics tools

Other signals

  • AI agents
  • automation tools
  • diagnostics tools
  • live site automation
  • debug field issues