Manager, Solutions Architecture - Continuous Bringup and Optimization

NVIDIA NVIDIA · Semiconductors · Tokyo, Japan +1 · Remote

Manager of Solutions Architecture focused on leading a team to consult, optimize, and improve the resiliency of customer AI factory infrastructures, including GPU-accelerated systems and AI workloads. The role involves hands-on infrastructure analysis, tuning, and establishing optimization/monitoring methodologies for large-scale AI/HPC systems.

What you'd actually do

  1. Lead a team dedicated to consulting, optimizing, and improving the resiliency of customer AI factory infrastructures, ensuring high service quality and operational perfection.
  2. Drive hands-on infrastructure analysis and tuning of complex GPU-accelerated systems, AI workloads, and datacenter environments, identifying areas for efficiency gains and operational improvements.
  3. Work closely with internal teams (Engineering, Product, Sales) and customer collaborators to align infrastructure strategies with business goals, enabling smooth, scalable AI deployments.
  4. Act as a technical authority on NVIDIA GPU, CPU and networking technologies, supporting customer discussions, architecture reviews.
  5. Establishing and evolving optimization and monitoring methodologies, using analytics and tooling to detect bottlenecks, reduce downtime, and ensure system health at scale.

Skills

Required

  • Over 4 years leading teams and 8+ overall years in service operations in large data centers, focusing on infrastructure performance.
  • Bachelor’s, Master’s, or PhD in Computer Science, Engineering, or a related field, with shown technical leadership in data center, server, and network operations.
  • Proficiency in both Japanese and English, demonstrating clear communication of technical topics across multicultural teams and with customers.
  • Deep expertise in data center architecture and operations, including servers, GPUs, NICs, networking topologies, storage systems, and Linux-based environments.
  • Strong analytical, solving problems, and decision-making skills, capable of identifying root causes, driving continuous improvement, and delivering resilient technical solutions.
  • Strong communication, time management, and organizational skills, along with experience in leading complex projects, guiding technical teams, and meeting important metrics.

Nice to have

  • Deep familiarity with AI infrastructure and workflows, including training/inference pipelines, MLOps/DevOps tools, containerization (Docker, Kubernetes), and large-scale system deployments.
  • Knowledge of data center infrastructure operations, including safety, security, environmental controls, and standard operating procedures.
  • Strong interpersonal and collaboration skills, with the ability to lead discussions, influence outcomes, and build positive relationships with both internal and external collaborators.

What the JD emphasized

  • AI factory infrastructures
  • GPU-accelerated systems
  • AI workloads
  • large-scale projects
  • AI deployments
  • customer-facing engagements

Other signals

  • customer-facing engagements
  • AI factory infrastructures
  • GPU-accelerated systems
  • AI workloads
  • large-scale projects
  • AI deployments