Technical Program Manager, Mtia Software

Meta Meta · Big Tech · Menlo Park, CA

This role is for a Technical Program Manager (TPM) focused on managing software and hardware programs related to AI Accelerators and large-scale AI clusters. The TPM will work with cross-functional teams to enable AI applications and use cases, focusing on requirements, middleware, software design, and deployment of AI training and inference workloads. The role involves identifying problems, developing solutions, troubleshooting, creating roadmaps, defining milestones, driving execution, and communicating with stakeholders, particularly with Software Development teams in Infrastructure (e.g., PyTorch) and hardware-specific software teams.

What you'd actually do

  1. Collaborate with Engineering and business owners to define program requirements, set priorities, and establish scope, which includes defining the roadmap and long-term strategy of the teams that you are partnering with
  2. Align with application and end-customer focused technical teams on software and system requirements and schedule
  3. Create execution strategies and build plans for the full stack software development
  4. Ensure lower layer components like libraries, tooling, provisioning software, operating system fully enable applications and datacenter operations
  5. Develop and drive a software benchmarking, analysis and optimization strategy for new hardware platforms

Skills

Required

  • B.S. in Computer Science, Electrical Engineering or a related technical discipline, or equivalent experience
  • 12+ years of software engineering, systems engineering, hardware engineering or technical product/program management experience
  • Experience delivering complex tech programs and/or products from inception to delivery
  • Knowledge of user needs, gathering requirements, and defining scope
  • Experience operating autonomously across multiple teams, demonstrated critical thinking, and experience driving technical strategy or direction
  • Communication experience and experience working with technical management teams to develop systems, solutions, and products
  • Organizational, coordination and multi-tasking experience
  • Analytical and problem-solving experience with large-scale systems
  • Experience establishing work relationships across multi-disciplinary teams and multiple partners in different time zones
  • Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
  • Understanding of AI algorithms and techniques
  • Experience working with hardware-specific software optimization
  • Experience working with capacity planning, migration and turn-up
  • Experience with high performance or AI training clusters
  • Experience with data center architecture and deployment
  • Understanding of Graphics or AI accelerator hardware architecture
  • Experience with system analysis and hardware-software co-design
  • Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
  • Understanding of AI software development life cycle
  • Understanding of scientific or technical computing techniques
  • Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies

Nice to have

  • PyTorch framework team

What the JD emphasized

  • AI Accelerator technologies
  • AI training and inference workloads
  • AI hardware platforms
  • large-scale AI clusters
  • AI applications
  • AI use cases
  • AI software applications
  • AI software development life cycle
  • AI algorithms and techniques
  • AI tools
  • AI technologies
  • AI Accelerator hardware architecture
  • AI training clusters
  • responsible, ethical AI practices
  • bias mitigation
  • quality and accuracy reviews
  • prompt/context engineering
  • agent orchestration

Other signals

  • AI hardware platforms
  • large-scale AI clusters
  • AI training and inference workloads
  • AI Accelerator technologies