PC Troubleshooting Guide: Diagnose and Fix Hardware Problems

Hardware failure costs 10–50× more than preventing it.
Emergency repairs, data recovery, lost productive hours — these add up fast. GGFix gives you early warning on every machine so the call to action is a $50 cleaning, not a $2,000 recovery.
Start 3-Day Free TrialNo card requiredPC Troubleshooting Guide: Diagnose and Fix Hardware Problems
PC troubleshooting starts with a systematic approach, not guesswork. Whether your machine won't boot, crashes under load, runs hot, or loses data without warning, every hardware problem follows a pattern — and every pattern has a diagnostic path. This guide covers the full spectrum of hardware failures by symptom and by component, gives you the first-line checks that professionals use, and links to every in-depth guide in our complete PC troubleshooting series. After 8 years of repairing Windows machines in Copenhagen — from dusty office desktops to overworked creative workstations — the single biggest mistake I see is people treating symptoms without identifying the root cause.
The Hardware Troubleshooting Mindset
PC hardware troubleshooting is the systematic process of identifying and resolving physical component failures that cause computer problems.
The word "systematic" is doing the heavy lifting in that definition. Random fixes — swapping parts, reinstalling Windows, blowing dust — sometimes work by accident. They almost never teach you what actually failed or why. A systematic approach narrows from the system level down to the component level, ruling out causes one by one until only one explanation remains.
The first decision in any diagnosis is hardware vs. software. A machine that crashes during a Windows update is almost certainly a software problem. A machine that crashes during a GPU stress test, with temperatures climbing toward 95°C, is almost certainly hardware. The overlap is narrower than most people assume. Our guide on distinguishing hardware from software problems covers this in detail — it's the right first read for any new problem.
Once you've established it's a hardware problem, work outward from the most likely candidates. Power and thermal issues are responsible for roughly 70% of hardware failures. Storage comes next. RAM after that. The CPU and motherboard are the least likely to fail spontaneously in a machine that was previously working. Expensive component swaps should always come last, after every diagnostic test has been run.
The First 5 Checks for Any PC Problem
Before opening any diagnostic tool or component guide, run these five checks on every problem machine. They catch the majority of issues in under 10 minutes and prevent wasted time on complex diagnostics.
-
Check all physical connections. Reseat RAM sticks, GPU, and any expansion cards. Reconnect SATA and power cables at both ends. A loose cable causes symptoms that look exactly like component failure. In a fleet environment, physical connection failures account for a surprising share of "hardware faults" that resolve with a simple reseat.
-
Read the temperatures. Download HWiNFO64 and run the sensors panel under any workload that triggers the symptom. CPU temperature above 90°C under load, GPU above 85°C, NVMe SSD above 70°C — any of these is a direct cause of throttling, crashes, and shutdowns. You need this data before doing anything else. See our complete hardware monitoring guide for what all the sensor readings mean.
-
Listen and look at boot. Does the machine POST? Do you hear beep codes? Is there anything on the display at all? A machine that produces one long beep and two short beeps is telling you the GPU failed to initialize — that's a very different problem from a machine that powers on with no display and no beeps. A machine that shows a BIOS splash screen but fails to boot Windows is different again.
-
Check Windows Event Viewer. Open Event Viewer (eventvwr.msc), navigate to Windows Logs > System, and filter for Critical and Error events in the timeframe around the problem. Hardware errors leave specific signatures here: disk errors appear as Disk event ID 7 or 11, memory errors as WHEA-Logger event ID 18 or 19, power issues as Kernel-Power event ID 41. Our Event Viewer hardware diagnostics guide explains how to read every relevant event.
-
Run the simplest possible isolation test. If the machine crashes during gaming but not during browsing, the GPU is the primary suspect. If it crashes during any intensive task, thermal or power issues are more likely. If it crashes randomly during light use, RAM or storage is often the cause. Reproduce the symptom deliberately, under controlled conditions, before you can be confident in any diagnosis.
Diagnosing by Symptom
Symptom-based diagnosis is where most people start — they know what the machine is doing wrong, but not why. Use the sections below to identify the likely cause and the right diagnostic path.
PC Won't Boot or Start
A machine that refuses to boot can fail at several stages, and each tells a different story. No power at all — no fans, no LEDs — points to the PSU, the wall outlet, or the power switch header on the motherboard. The PSU is statistically the most likely culprit: power supplies fail at a rate of 3–7% within the first three years, and partial failures often cause intermittent no-power symptoms before complete failure. Read our PSU failure signs guide to learn the full diagnostic process.
A machine that powers on but doesn't reach the BIOS or Windows splash screen is a POST failure. This points to RAM (by far the most common cause), GPU, or a corrupted BIOS. A machine that reaches the BIOS but won't boot Windows is almost always storage — either the drive has failed, the boot sector is corrupted, or Windows itself is damaged.
If you hear beep codes during POST, those codes are direct hardware error messages from the motherboard. Our computer beep codes guide decodes every common BIOS beep sequence for AMI, Award, and Phoenix BIOS variants.
PC Is Slow or Sluggish
Slow performance is the most common PC complaint and the most frequently misdiagnosed. Most people assume it's software — and sometimes it is. But hardware causes are responsible for a significant share of genuine slowdowns: thermal throttling (the CPU or GPU reducing clock speeds to avoid heat damage), a failing HDD, NVMe SSD throttling due to heat, or RAM operating in single-channel mode when it should be dual-channel.
A thermally-throttled machine can lose 30–73% of its processing performance under sustained load, yet look perfectly fine at idle. The symptom is a machine that feels responsive when starting a task but becomes progressively slower as it continues — rendering a video, compiling code, running a batch export. Our guide on the 10 most common reasons a PC is slow covers both hardware and software causes with a clear diagnostic sequence for each.
PC Crashes or Freezes
Crashes that happen specifically under load — gaming, video rendering, stress testing — are almost always thermal or power-related. The system hits a hardware threshold (temperature or voltage) and either throttles, crashes, or shuts down as a protective measure. A step-by-step guide to diagnosing crashes under load walks through the exact testing sequence, from temperature monitoring to PSU load testing.
Blue Screen of Death crashes with specific stop codes narrow the cause dramatically. MEMORY_MANAGEMENT and PAGE_FAULT_IN_NONPAGED_AREA point to RAM. CRITICAL_PROCESS_DIED and KERNEL_DATA_INPAGE_ERROR point to storage. WHEA_UNCORRECTABLE_ERROR can indicate CPU, RAM, or motherboard hardware errors. Our BSOD hardware causes guide decodes the most common stop codes and explains the hardware failure behind each.
PC Shuts Down Randomly
Random shutdowns — no blue screen, no warning, just instant power-off — are thermal or PSU problems the vast majority of the time. The machine hits a thermal limit and the hardware protection circuit cuts power before damage occurs. This is actually a sign the protection is working, but it means the underlying thermal problem is serious.
A failing PSU that can no longer deliver stable voltage under load will also cause sudden shutdowns, often without any warning in the logs. The pattern is usually shutdowns that happen more frequently as the PSU ages or as ambient temperature rises. Our guide on why PCs shut down randomly covers the full thermal investigation process, including how to confirm whether it's heat or power at fault.
Screen Problems and Visual Artifacts
Visual artifacts — strange pixels, color corruption, geometry distortion, flickering lines, screen tearing unrelated to frame rate — are almost always GPU failures. The GPU's VRAM or shader processors are degrading, producing incorrect output. Artifacts that only appear under GPU load (gaming, video playback, 3D rendering) strongly confirm hardware failure rather than a display or cable issue.
Artifacts that appear at the BIOS level — before any driver loads — confirm the GPU hardware itself has failed, ruling out driver corruption. Our GPU artifacts guide shows what different artifact types look like, explains what they indicate about which part of the GPU is failing, and covers how to test definitively.
Loud Fans or Abnormal Noise
Fans running at full speed constantly are responding to high temperatures, a failed temperature sensor, or a failed fan that can no longer ramp correctly. The machine can't regulate correctly, so it defaults to maximum airflow as a safety measure. Separately, fan bearings that are mechanically failing produce a grinding or buzzing sound that gets louder over time — this indicates the fan will fail completely within weeks to months.
A clicking sound from a hard drive is a medical emergency for your data. This is the read/write head making physical contact with the platter surface — mechanical failure is imminent. Stop using the drive and back up immediately. Our fan diagnosis guide covers both temperature-driven and mechanical fan failures in detail.
Storage and Drive Errors
Slow file access, frequent application hangs, files that become corrupted, and Windows event ID 7/11 disk errors all point to storage failure. SSDs fail differently than HDDs: they tend to develop bad sectors quietly, slow down gradually, or fail suddenly with no prior warning — which is why SMART data monitoring is essential.
SMART (Self-Monitoring, Analysis, and Reporting Technology) is built into every modern drive and reports reallocated sectors, uncorrectable errors, and estimated remaining life. Our guide on reading SMART data to predict SSD failure explains which SMART attributes matter most and what the numbers actually mean.
Blue Screen of Death (BSOD)
The BSOD is Windows acknowledging a hardware or driver failure severe enough that continuing operation would cause data corruption. The stop code is the most important piece of information — it identifies the failure category. Record it exactly as shown, then use our BSOD hardware causes and fixes guide to trace the stop code to its likely hardware source and run the appropriate component test.
Diagnosing by Component
When you already know which component is suspect — or when symptom-based diagnosis has pointed you toward a specific part — the component-level guides go deep on testing methodology.
CPU Problems
CPU failures are rare in machines that were previously working. The CPU is the most thermally managed component in the system: Intel 14th-gen processors have a Tjunction max of 100°C, AMD Ryzen 7000 and 9000 series processors have a maximum of 95°C, and both throttle aggressively before reaching those limits. A CPU that's genuinely failing rather than thermally throttling will typically produce WHEA_UNCORRECTABLE_ERROR BSODs or fail during specific computational tasks.
Thermal paste degradation — which happens over 3–5 years under normal use — causes CPU temperatures to climb steadily over time as the interface between the processor and cooler degrades. Our CPU temperature guide covers normal operating ranges, danger zones, and how to determine whether temperature is the root cause of CPU-related symptoms.
GPU Problems
GPU failures manifest as artifacts, display output failure, crashes during graphically intensive tasks, or overheating. A GPU that overheats has typically lost airflow (fan failure or dust blockage), has degraded thermal paste between the GPU die and heatsink, or is running a sustained workload beyond its thermal design power.
Our GPU overheating signs and prevention guide covers how to identify overheating, how to confirm the symptoms are GPU-related rather than driver-related, and what maintenance actions resolve thermal issues without replacing the card.
RAM Problems
Failing RAM is responsible for a disproportionate share of crash and instability reports. RAM errors can cause BSODs, application crashes, file corruption, boot failures, and even data loss — and the symptoms are often intermittent, making diagnosis frustrating without the right tools.
MemTest86 is the definitive RAM test: it runs outside the operating system, which means it tests the physical hardware rather than the Windows memory management layer. A single error in MemTest86 confirms bad RAM and warrants immediate replacement. Our guide on signs of failing RAM and how to test covers the full diagnostic sequence, including how to identify which specific stick is faulty in a multi-stick configuration.
Storage (SSD and HDD) Problems
Storage diagnostic methodology differs between SSDs and HDDs. For HDDs, listen for clicking or grinding sounds, run CHKDSK /r from an elevated command prompt, and check SMART data with CrystalDiskInfo. Reallocated sectors above zero and pending sectors above zero both indicate physical damage and imminent failure.
For SSDs, SMART data is still the primary diagnostic tool, but the key attributes differ: look at Reallocated Sectors Count, Uncorrectable Sector Count, and remaining life percentage (which NVMe drives report as a percentage of rated write endurance consumed). Our SMART data and SSD failure prediction guide explains each attribute and when replacement is urgent vs. advisory.
Power Supply Problems
The PSU is responsible for more "mysterious" hardware failures than any other component. A PSU that can no longer maintain stable voltage under load will cause random crashes, BSODs, spontaneous shutdowns, and even damage downstream components — yet the PSU itself often continues to power on the machine in standby. This is why many technicians test everything except the PSU first and miss the actual cause.
High-load PSU failure symptoms include crashes that happen under heavy GPU load (when power draw spikes), shutdowns that correlate with summer temperature increases (failing PSUs become less efficient as temperatures rise), and increasing frequency of instability over weeks. Our PSU failure signs guide covers the full testing methodology, including how to use a PSU tester and what voltage readings indicate marginal vs. failed hardware.
Motherboard Problems
Motherboard diagnosis is complex because the motherboard interfaces with every other component. VRM (Voltage Regulator Module) failures cause CPU throttling and instability that looks identical to a CPU problem — and most monitoring tools don't expose VRM temperatures by default. Our VRM temperature and motherboard overheating guide explains how to check VRM health and what temperatures indicate safe vs. dangerous operation.
Motherboard failures that prevent POST produce beep codes (see above) or specific error codes on boards with a 2-digit POST code display. A board that passes POST but produces instability under specific conditions — particularly memory errors that MemTest86 doesn't reproduce with RAM in different slots — may have faulty memory slots rather than faulty RAM sticks.
Diagnostic Tools Every Technician Needs
These tools are free, trusted by professionals, and cover the full hardware stack. Build familiarity with them before you need them — reading unfamiliar output during an active failure investigation wastes time.
| Tool | What It Tests | Key Data to Read |
|---|---|---|
| HWiNFO64 | CPU, GPU, RAM, storage, fans, VRMs, motherboard sensors | All temperatures, voltages, fan speeds in real time |
| CrystalDiskInfo | HDD and SSD health | SMART attributes, reallocated sectors, drive health status |
| MemTest86 | RAM physical hardware | Error count per memory address; any error = bad RAM |
| Prime95 | CPU stability and thermal performance | Temperature under full load; crash = instability |
| FurMark / 3DMark | GPU stability and thermal performance | GPU temperature under sustained full load |
| Windows Event Viewer | System-wide error log | Event ID 41 (power), ID 7/11 (disk), ID 18/19 (memory) |
| CrystalDiskMark | Storage read/write performance | Actual throughput vs. manufacturer spec |
Our Windows Event Viewer hardware diagnostics guide explains how to filter for hardware-specific events, read the error details, and correlate event timestamps with the symptom timeline.
For ongoing monitoring across a fleet — where intermittent failures happen when no technician is present — GGFix captures hardware sensor data continuously, 24 hours a day. Intermittent thermal events, voltage fluctuations, and drive errors that appear and resolve between manual spot-checks are logged automatically. A technician doing a monthly site visit cannot observe a machine's thermal behavior at 2 AM on a hot Tuesday in July. Continuous monitoring does. In our telemetry data across hundreds of monitored machines, approximately 23% of workstations over two years old show at least one anomalous reading pattern in the 30 days before a reported hardware failure.
Stress Testing: Confirming and Ruling Out
Stress testing is how professionals confirm a diagnosis rather than guess at it. The principle is simple: reproduce the conditions that trigger the symptom in a controlled, instrumented way, then observe exactly what happens. If a machine crashes during a GPU stress test at 89°C, you have confirmed evidence of a thermal failure — not a suspicion.
Stress tests also verify fixes. Replace the thermal paste, clean the heatsink, and run the same test again. If temperatures drop from 89°C to 72°C and the crash doesn't recur, the fix worked. Without a stress test to confirm, you're releasing a machine back to the user without evidence that the underlying problem is resolved. Our stress testing guide explains how to stress each component safely, what temperatures and behaviors indicate pass vs. fail, and how to structure the test sequence to avoid masking one failure with another.
When to Call a Professional
Hardware troubleshooting has a clear DIY/professional threshold. The diagnostic work — reading temperatures, running MemTest86, checking SMART data, reviewing Event Viewer — is well within any competent IT person's capability. The physical repair work is where the threshold matters.
Do it yourself:
- Reseating RAM, GPU, and expansion cards
- Replacing a failed drive (SSD/HDD swap)
- Replacing fans
- Replacing thermal paste on CPU and GPU
- Cleaning dust from heatsinks and case
- Replacing a PSU
Call a professional:
- Motherboard replacement (surface-mount components, BIOS flashing, physical damage to traces)
- Laptop screen repair (high breakage risk, proprietary connectors)
- Data recovery from a drive that has begun clicking (requires clean-room tools)
- GPU reball (BGA solder joint failure — requires hot air rework station)
- Any diagnosis that requires component-level electronic testing
After any professional repair — or after completing your own — run a structured validation sequence before returning the machine to service. Assume nothing is fixed until it passes a controlled test under the same conditions that triggered the original failure. Our post-repair hardware validation guide provides the exact checklist used in professional repair workflows.
All PC Troubleshooting Guides: Complete Index
This pillar page is the starting point. Each link below goes deeper on a specific symptom, component, or diagnostic methodology.
| Guide | What It Covers |
|---|---|
| Is It Hardware or Software? | First-step framework for any new problem |
| 10 Reasons Your PC Is Slow | Hardware and software causes of slow performance |
| BSOD: Hardware Causes and Fixes | Stop code interpretation and component testing |
| PC Crashes Under Load: Diagnostic Guide | Thermal and power failure investigation |
| Why Your PC Shuts Down Randomly | Thermal protection and PSU failure diagnosis |
| Fan Running at Full Speed | Fan control failure and bearing diagnosis |
| How to Read SMART Data | SSD and HDD health monitoring and failure prediction |
| RAM Issues: Signs and Testing | MemTest86 methodology and RAM failure patterns |
| PSU Failure Signs | Power supply diagnosis and voltage testing |
| How to Stress Test Your PC Safely | Controlled component testing methodology |
| Windows Event Viewer for Hardware | Reading hardware error events in Windows logs |
| Post-Repair Validation Guide | Confirming hardware fixes actually worked |
| GPU Artifacts: Causes and Fixes | Visual failure modes and GPU diagnosis |
| Computer Beep Codes Explained | POST error interpretation for all major BIOS variants |
Cross-pillar guides that complement this series:
| Guide | Why It's Relevant |
|---|---|
| CPU Temperature: Normal and Dangerous Ranges | Baseline for every thermal diagnosis |
| GPU Overheating: Signs and Prevention | GPU thermal failure patterns |
| VRM Temperature and Motherboard Overheating | Motherboard component thermal failures |
| Complete Hardware Monitoring Guide | Understanding all sensor data in context |
Frequently Asked Questions
Q: How do I diagnose a hardware problem on my PC?
Start with the five universal checks: verify physical connections, read the temperatures using HWiNFO64, listen for POST beep codes, check Windows Event Viewer for hardware error events, and reproduce the symptom under controlled conditions. This sequence identifies the majority of hardware failures without any component replacement. Tools like GGFix that monitor hardware sensors continuously can surface intermittent failures — like a drive that produces occasional read errors overnight — that manual checks miss entirely.
Q: What causes PC crashes and freezes?
Hardware crashes are most commonly caused by thermal throttling or thermal shutdown (CPU or GPU exceeding safe operating temperatures), PSU voltage instability under load, failing RAM, or storage errors. Software crashes — where you see a BSOD with a specific stop code — often have hardware root causes: MEMORY_MANAGEMENT points to RAM, KERNEL_DATA_INPAGE_ERROR points to storage, and WHEA_UNCORRECTABLE_ERROR points to CPU, RAM, or motherboard hardware failures. Temperature monitoring during the crash is the fastest way to separate thermal causes from others.
Q: How can I tell if my problem is hardware or software?
Hardware failures typically produce consistent symptoms under specific workloads — a machine that crashes every time GPU load exceeds 80% is almost certainly a hardware thermal issue. Software failures tend to be more variable, triggered by specific applications or after specific system events like updates. If the problem occurs before Windows loads (at the BIOS level, or during POST), it is definitively hardware. Our guide on distinguishing hardware from software problems gives a systematic decision framework.
Q: What free tools do IT technicians use to diagnose PC hardware?
The core toolkit is: HWiNFO64 (all sensor data — temperatures, voltages, fan speeds), CrystalDiskInfo (drive SMART data and health status), MemTest86 (RAM physical testing, runs outside Windows), Prime95 (CPU stability and thermal testing), and Windows Event Viewer (system error log, built into Windows). These five tools cover the full hardware stack and are sufficient to confirm or rule out failures in CPU, GPU, RAM, storage, and power supply. For fleet environments, continuous monitoring software supplements these spot-check tools by capturing data between manual sessions.
Q: When should I replace a PC instead of repairing it?
Repair is cost-effective when a single component has failed and the rest of the machine is healthy — a failed drive, failing RAM, or a degraded PSU in a 3-year-old machine worth repairing. Replacement makes more sense when multiple components are failing simultaneously (typically in machines 5–7 years old), when the motherboard requires replacement in a machine with aging components throughout, or when repair costs exceed 50–60% of the replacement cost. Our hardware lifecycle guide provides a structured cost analysis framework for making this decision for individual machines and entire fleets.
Q: How do I know if my PC's temperature is causing problems?
Download HWiNFO64 and run the sensors panel while reproducing the workload that triggers symptoms. CPU temperatures consistently above 90°C under load indicate inadequate cooling. GPU temperatures above 85°C under gaming or rendering loads suggest airflow or thermal paste issues. NVMe SSD temperatures above 70°C indicate throttling that will slow storage performance by 30–50%. Any temperature that climbs continuously during a sustained task (rather than reaching a stable plateau) indicates a thermal management failure. Our CPU temperature guide and GPU overheating guide give full temperature range references for current-generation hardware.
Find out if your hardware has problems right now.
GGFix monitors 50+ sensors per machine plus the top 25 processes every minute, decodes BSODs into plain English, and pushes alerts to your phone in under 10 seconds.
- 3-day free trial — no credit card, 1 machine included
- Installs silently as a Windows Service (2 minutes)
- 50+ sensors + top 25 processes monitored every minute
- Auto-decodes BSODs and Event IDs 41 / 1001 / 219 / WHEA
- AI names the exact app that caused any crash or spike
- Telegram or email alerts in under 10 seconds
| Scenario | Typical cost (USD) |
|---|---|
| Emergency repair after hardware failure | $300 – $1,500 |
| Data recovery (worst case) | $500 – $2,500 |
| Lost workday per incident | $150 – $800 |
| Preventive maintenance (if flagged early) | $30 – $130 |
| GGFix monitoring (per machine / month) | $20 |
| GGFix monitoring (per machine / year — 2 months free) | $200 |
Early warning is the cheapest insurance you can buy. GGFix catches problems when the fix is still cheap — and names the exact app, sensor, or BSOD code responsible.
GGFix Technical Team
Writing about hardware monitoring, fleet management, and keeping machines alive. Powered by GGFix.
Related Articles
PSU Failure Signs: When Your Power Supply Is Dying
A dying PSU is the most misdiagnosed component in PC repair. Voltage instability, load-specific crashes, and USB dropouts are the real warning signs — here is what the ATX spec requires, how long quality units actually last, and which diagnostic tools work.
The Real Cost of Hardware Failure: A Business Impact Analysis
Hardware failure costs 5-10x the price of the broken component when you count downtime, lost productivity, data recovery, and emergency labor. This analysis breaks down the real numbers for small and mid-sized businesses.
Real-Time vs. Periodic Monitoring: Which Wins
A weekly temperature check tells you what your hardware looked like at 9 AM on Tuesday. A continuous monitoring agent tells you what happened at 3 AM Thursday when the fan bearing seized. The gap between those two answers is the difference between proactive and reactive IT.
[ free 3-day trial · no credit card ]
Know before it breaks.
GGFix installs in 2 minutes and starts watching your hardware immediately — CPU temps, GPU load, disk health, fan speeds, and 50+ sensors. AI tells you what's wrong before it causes damage.