We serve candidates globally, so we only activate Google Tag Manager and other analytics after you opt in. This keeps us aligned with GDPR/UK DPA, ePrivacy, LGPD, and similar rules. Essential features still run without analytics cookies.
🤖 15+ AI Agents working for you. Find jobs, score and update resumes, cover letter, interview questions, missing keywords, and lots more.
Staff AI/ML Validation Engineer at AMD - QATestingJobs.com
A
Staff AI/ML Validation Engineer
AMD • Hyderabad, Telangana, India
onsitefull-time
Posted Jan 29, 2026Apply by Jan 29, 2027
Role & seniority: Staff-level MTS Software System Design Engineer focused on GPU compute/AI validation, debug & performance (lead technical authority in validation and performance initiatives).
Stack/tools: GPU architecture and parallel compute models; GPU drivers/runtimes; Linux & Windows; Python, Groovy, GitHub; CI/CD; GPU profiling/debug tools; hardware debug (JTAG, crash/log analysis); AMD/architecture collaboration; dashboards for data-driven decisions.
Top 3 responsibilities
Own end-to-end validation strategy for GPU compute and AI workloads (HPC, ML, DL) and ensure feature readiness.
Lead post-silicon validation, silicon bring-up, advanced debug, and root-cause analysis across HW/FW/drivers/runtimes/OS.
Lead performance characterization/optimization, identify bottlenecks, drive workload-aware improvements, and validate performance-per-watt and scalability; architect automation and integrate tests into CI/CD.
Must-have skills
8+ years in GPU compute/AI validation, debug, or performance
Deep GPU architecture knowledge and parallel compute models
Experience with AI/HPC workloads; drivers/runtimes; Linux and Windows
Hands-on GPU profiling/debugging; Python, Groovy; CI/CD; test development
Strong technical leadership and communication; mentoring/design reviews
Nice-to-haves
ROCm or similar compute stacks; compiler/runtime optimizations for AI workloads
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
MTS SOFTWARE SYSTEM DESIGN ENGINEER
THE ROLE
We are looking for a Staff-level GPU Compute / AI Validation, Debug & Performance Engineer to lead validation, deep-debug, and performance optimization for next-generation GPU compute and AI platforms. This role requires strong expertise in GPU architecture, parallel computing, and AI workloads, along with the ability to drive cross-functional technical initiatives in a global MNC environment.
The ideal candidate will own complex validation areas, act as a technical authority for GPU compute/AI debug and performance, and influence architecture and design decisions through data-driven insights.
KEY RESPONSIBILITIES
GPU Compute / AI Validation Leadership
Own end-to-end validation strategy for GPU compute and AI workloads (HPC, ML, DL).
Define validation scope, coverage, and success metrics for compute pipelines.
Lead post-silicon validation, silicon bring-up, and feature readiness for GPU compute.
Ensure functional correctness across drivers, firmware, runtime, and frameworks.
Advanced Debug & Root Cause Analysis
Act as debug lead for complex GPU compute/AI issues spanning HW, FW, drivers, runtimes, and OS.
Analyze failures using GPU traces, register dumps, crash dumps, JTAG, logs, windbg, counters and using AMD different profiler/debugger tools.
Work directly with architecture, RTL, and design teams to influence fixes and mitigations.
Performance Analysis & Optimization
Lead performance characterization and optimization for AI and compute workloads.
Identify bottlenecks across compute units, memory bandwidth, cache, interconnect, and power.
Drive workload-aware optimizations for training and inference use cases.
Validate performance-per-watt and scalability against product and architectural goals.
GPU ComputeAI ValidationDeep-DebugPerformance OptimizationGPU ArchitectureParallel ComputingHPCMLDLPost-Silicon ValidationRoot Cause AnalysisPythonCI/CDCross-Functional InfluenceRuntimesFirmwaremulti-location
Automation, Tools & Infrastructure
Architect and drive automation frameworks for compute/AI validation and performance.
Develop tooling using Python to improve efficiency and coverage.
Integrate tests into CI/CD pipelines and regression systems.
Enable data-driven decision making through dashboards and performance tracking.
Technical Leadership & Cross-Functional Influence
Drive cross-team alignment with architecture, RTL, firmware, driver, compiler, and AI software teams.
Influence architectural decisions through early validation and performance feedback.
Represent the team in global technical forums and design reviews.
REQUIRED QUALIFICATION
Technical Expertise
8+ years of experience in GPU compute / AI validation, debug, or performance
Deep understanding of GPU architecture and parallel compute models
Strong experience with AI/ML and HPC workloads
Expertise in GPU drivers, runtimes, and system software (Linux and Windows)
Hands-on experience with GPU profiling and debug tools
Proficiency in Python, Groovy, Github, Linux, Window, CI/CD, Test Development and performance analysis
Leadership & Soft Skills
Proven technical leadership at Senior/Staff level
Ability to lead ambiguous, high-impact problem areas
Strong communication skills
Mentoring and design-review experience
PREFERRED EXPERIENCE
Product development or systems engineering background with hardware platforms and their software & firmware ecosystems
Excellent verbal communication and written, presentation skills
Excellent interpersonal, organizational, analytical, planning, and technical leadership skills
Proven record of accomplishment in delivering large multi-functional product solutions
Experience working in a fast-paced matrixed technical organization and multi-site environment
Experience with ROCm, or similar compute stacks
Experience with compiler or runtime optimizations for AI workloads
Knowledge of power, thermal, and reliability (RAS) aspects of GPUs
Prior experience in leading GPU or AI accelerator products
ACADEMIC CREDENTIALS
Bachelor’s or Master's degree in Computer or Electrical Engineering or equivalent
#LI-NR1
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.
AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.
This posting is for an existing vacancy.