Cookies & analytics consent
We serve candidates globally, so we only activate Google Tag Manager and other analytics after you opt in. This keeps us aligned with GDPR/UK DPA, ePrivacy, LGPD, and similar rules. Essential features still run without analytics cookies.
Read how we use data in our Privacy Policy and Terms of Service.
🤖 15+ AI Agents working for you. Find jobs, score and update resumes, cover letter, interview questions, missing keywords, and lots more.

Microsoft • Atlanta, Georgia, United States
Salary: $119,800 - $234,700 / year
Role & seniority: Senior Hardware Quality Engineer (Reliability Engineering IC4 level)
Stack/tools: Large-scale AI/GPU data-center hardware; GPU/CPU/AI hardware, power, storage, platform subsystems; telemetry analysis, diagnostics workflow design, out-of-band telemetry (BMC SEL, PCIe health), diagnostics tooling, FMEA, quality playbooks; coordination with data center ops, firmware, diagnostics, vendors; CPLD firmware/power quality issues.
Lead deep-dive investigations into complex hardware failures (no-boot, GPU throttling, power delivery, BMC issues) and perform root-cause analysis.
Validate remediation via single-node and rack-level testing before fleet re-entry; own fleet-level quality assessments for AI/GPU deployments.
Analyze large telemetry/logs to identify systemic failure patterns; design lightweight diagnostics/telemetry workflows to enable rapid validation and reduce rework; drive preventative quality controls and left-shifting detection.
Advanced degree with requisite practical engineering experience (varies by degree: PhD 2+, MS 4+, BS 5+, or 12+ years total).
Strong experience with hardware reliability, root-cause analysis, and debugging of server architectures (GPU/AI hardware, memory, subsystems).
Proficiency in telemetry analysis, diagnostics workflow design, FMEA, and cross-functional collaboration (firmware, diagnostics, operations, vendors).
Ability to operate with data cen
Overview Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft’s expanding Cloud Infrastructure and responsible for powering Microsoft’s “Intelligent Cloud” mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive, and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for a passionate Senior HW Quality Engineer to help achieve that mission. As Microsoft's cloud business continues to grow the ability to deploy new offerings and hardware infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for hardware manufacturing, improving the planning process, quality, delivery, scale and sustainability related to Microsoft cloud hardware.
Responsibilities This role is responsible for end‑to‑end hardware quality engineering for large‑scale AI and GPU‑based data center fleets, with a focus on early failure detection, root cause analysis, and operational quality enablement. The position partners closely with data center operations, firmware, platform engineering, diagnostics, and vendor teams to improve hardware reliability, reduce repeat incidents, and accelerate mean‑time‑to‑resolution (MTTR) across production AI infrastructure. The scope and depth of this role reflect workload and technical responsibilities performed by a Senior HW Quality Engineer working in the data center, including debugging, telemetry analysis, diagnostics workflow design, and quality process ownership across GPU, power, storage, and platform subsystems. Lead deep‑dive investigations into complex hardware failures across AI/GPU platforms, including no-boot, GPU throttling, power delivery issues, and BMC related issues. Perform single‑node and rack‑level validation to confirm hardware remediation effectiveness before fleet re‑entry. Analyze large volumes of hardware telemetry, logs, and diagnostics data to identify systemic failure patterns and repeat offenders. Define and drive lightweight diagnostics and telemetry workflows that allow technicians and operations teams to validate repairs before ticket closure, reducing repeat failures and rework. Partner with diagnostics and platform teams to enable out‑of‑band telemetry collection (e.g., BMC SEL, PCIe health indicators), integrate diagnostics into existing operational tools and workflows. Own fleet‑level quality assessments for AI and GPU deployments, including early deployment phases and ongoing production monitoring. Drive improvements to Failure detection latency, root cause attribution accuracy and preventative quality controls (left‑shifting detection into earlier lifecycle stages). Sync with Firmware and electrical engineering teams on corrective actions (e.g., CPLD firmware changes, power quality investigations). Collaborate with supply chain and spares teams to ensure hardware availability aligns with failure remediation needs. Ability to work with Data center operations leadership to ensure solutions scale globally and align with operational SLAs. Produce and maintain technical documentation, failure mode analyses (FMEA), and quality playbooks for sustained operational excellence.
Qualifications
Reliability Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.
https: //careers.microsoft.com/us/en/us-corporate-pay
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.