Sr. Staff/principal Validation Engineer- Ai/ml

AMD AMD · Semiconductors · Hyderabad, India · Engineering

This role focuses on validating AI/ML workloads on GPU infrastructure and leading the adoption of Agentic AI to transform test strategy, planning, and solution development. The engineer will architect and implement AI agents to automate critical workflows across software development, testing, and release management, establishing best practices for applying AI across engineering workflows.

What you'd actually do

  1. Develop novel approaches for validating AI/ML workloads on GPU infrastructure
  2. Lead the adoption of Agentic AI to transform test strategy, planning, and test solution development, improving validation effectiveness and release confidence.
  3. Drive AI-assisted diagnostics and resolution of complex DevOps challenges across CI/CD, build, deployment, and engineering infrastructure ecosystems.
  4. Architect and implement AI agents that automate critical workflows across software development, testing, and release management, delivering measurable gains in productivity and operational efficiency.
  5. Establish best practices, governance, and scalable patterns for applying AI across engineering workflows while ensuring solution quality, reliability, and maintainability.

Skills

Required

  • Design, develop, and execute comprehensive test plans, test strategies, and test cases for complex systemlevel features.
  • Define the multi-year vision and roadmap for system software testing infrastructure and methodologies
  • Perform functional, integration, and system-level testing across different Linux distributions (e.g., Ubuntu, RHEL, SLES, etc.)
  • Analyze and debug issues across the software stack using strong knowledge of Linux internals, system services, kernel-level behavior, performance tools, and logs.
  • Ensure test coverage for the GPU product features
  • Review requirements and create associated test cases to ensure traceability
  • Develop and maintain automated tests using gtest, ctest, and other relevant test frameworks.
  • Collaborate with cross-functional teams to ensure testability and influence design decisions to improve product quality.
  • Write clean, maintainable C/C++, Python code for test automation, validation tools, and testing infrastructure.
  • Drive continuous improvements in test processes, tooling, and coverage.
  • Investigate failures, perform rootcause analysis, and provide detailed debug information to development teams.
  • Mentor junior engineers and contribute to building a high-quality engineering culture.
  • Must be a self-starter, and able to independently drive tasks to completion
  • Proven experience in validating complex systems with a focus on performance, scalability, and reliability across integrated hardware–software ecosystems.
  • Demonstrated ability to leverage Agentic AI to define test strategies, create detailed test plans, and develop effective test solutions that improve quality and release readiness.
  • Proven ability to apply AI-assisted problem solving to diagnose and resolve DevOps-related issues across CI/CD pipelines, build systems, deployment processes, and engineering environments.
  • Strong ability to design and build AI agents that automate workflows, reduce manual effort, and improve productivity across software engineering, testing, and release operations.

Nice to have

  • General Computer Architecture concepts
  • Windows and Linux Operating Systems
  • Cloud, Virtualization and Container environments
  • System level, functional and environmental stress testing
  • Deep Learning, High Performance Computing or GPU Server Based computing a big plus.
  • Knowledge of CUDA GPU Computing Languages a plus.
  • Parallel Computing Skills with MPI Programing experience a plus.
  • Proven record in large scale data center engineering
  • Experience with CI/CI tools (Jenkins, GitHub Actions, GitLab CI).
  • Knowledge of container technologies (Docker, Podman, Kubernetes).
  • Experience with performance testing or hardware–software systems.
  • Excellent interpersonal, organizational, analytical, planning, and technical leadership skills

What the JD emphasized

  • Agentic AI
  • AI agents
  • AI-assisted diagnostics
  • validating AI/ML workloads

Other signals

  • AI/ML workloads on GPU infrastructure
  • Agentic AI to transform test strategy
  • AI-assisted diagnostics
  • Architect and implement AI agents