All Posts

Hardware or Software Problem? How to Tell the Difference

G
GGFix Technical Team
9 April 202515 min read109 views
GGFix monitors this 24/7

Your hardware is degrading. The question is whether you find out first.

GGFix monitors 50+ sensors per machine, tracks the top 25 processes every minute, decodes every BSOD into plain English, and alerts you in under 10 seconds — before degradation turns into a failure, a repair bill, or lost work.

Start 3-Day Free TrialNo card required

Hardware or software problem — the same crash can mean either, and the fix is completely different. Replacing RAM when the culprit is a bad driver wastes money. Reinstalling Windows on a failing SSD solves nothing. This guide covers the exact signals that distinguish hardware failure from software failure, the 6-step diagnostic sequence technicians use to isolate root cause, and why 30 days of sensor history can answer the question in 5 minutes.

This post is part of our hardware troubleshooting and diagnostics series — the systematic approach to diagnosing PC failures before replacing parts. If your machine is also running slowly or freezing, our guide to why PCs slow down covers the full spectrum of causes beyond hardware failure.

Why the Same Symptom Points in Two Directions

Hardware failures and software failures produce nearly identical visible symptoms: BSODs, application crashes, system freezes, random restarts. A failing RAM module causes file corruption and application crashes that look exactly like OS bugs. A degrading PSU triggers GPU driver timeouts that are indistinguishable from a bad driver update. A dying SSD produces corrupted Windows installations that survive reinstall after reinstall.

The result is a diagnostic trap: users reinstall Windows on a failing SSD, replace RAM that was working fine, and update GPU drivers repeatedly for crashes caused by voltage instability. In 8 years of hands-on fleet diagnostics, the most common misdiagnosis we see is a machine that crashes intermittently under load, passes a Safe Mode stability check, gets a fresh Windows install — on the same failing SSD that caused the original problem. Three weeks later the symptoms return.

Understanding which signals point to hardware vs. software prevents this loop before it starts.

The Signal Table: Hardware vs. Software at a Glance

This table provides the first-pass framework. One signal alone is not definitive. Three signals pointing in the same direction usually is.

SignalPoints to HardwarePoints to Software
Problem happens in Safe ModeYesNo
Multiple different BSOD stop codesYesNo (same code each time)
Problem started after a physical changeYesNo
Reproduces on a fresh Windows installYesNo
Reproduces on a Linux live bootYesNo
Problem only in specific applicationsNoYes
Appeared immediately after update/installNoYes
Sensor readings outside normal rangeYesNo
Problem is temperature-correlatedYesNo
SMART data shows bad sectorsYesNo
WHEA Event ID 18 in Event ViewerYesNo

The gray zone exists: firmware bugs sit between hardware and software. A bad GPU driver can temporarily corrupt hardware state. The table handles the 90% of cases that fall cleanly into one category.

6 Hardware Signals That Are Almost Never Software

1. Problem persists in Safe Mode

Safe Mode boots Windows with only Microsoft-signed, whitelisted kernel drivers and no third-party services. If a machine crashes in Safe Mode, it has eliminated virtually all third-party software as the cause.

Important nuance: Safe Mode uses the Microsoft Basic Display Adapter (no GPU driver) and runs at reduced power draw. Problems that only appear under GPU load or combined CPU+GPU power draw may not reproduce. According to Microsoft's Event ID 41 documentation, underpowered or degrading PSUs are a documented cause of instability that masks in Safe Mode's low-load environment. Safe Mode stability is strong evidence of a software cause — but Safe Mode instability is near-definitive evidence of hardware.

2. Multiple different BSOD stop codes

Software bugs tend to produce the same stop code each time. A bad driver with a page fault produces 0xD1 DRIVER_IRQL_NOT_LESS_OR_EQUAL consistently. Hardware failures — particularly RAM corruption and PSU voltage instability — corrupt memory contents unpredictably, producing different stop codes across crashes: 0x7F, 0x50, 0x1E, 0x9F appearing in sequence on the same machine. This scatter pattern is the RAM diagnostic clue experienced technicians recognize immediately. For a full breakdown of which stop codes indicate hardware vs. software, see our BSOD hardware causes and fixes guide. If you are seeing more than two different BSOD codes on the same machine over two weeks, run MemTest86 before touching any software.

3. Problem started after a physical change

New RAM installed, machine was moved, a component was upgraded, a power surge occurred. Hardware problems that begin immediately after a physical change are the most directly traceable cases. A hairline crack in a solder joint that was always marginal can manifest after vibration from transport — so a physical event does not have to be the cause, just the trigger.

4. Failure reproduces on a fresh Windows installation

Reinstalling Windows eliminates essentially all software: OS files, drivers, registry, installed applications. If a machine continues crashing after a clean install — particularly within the first week of normal use — the hardware is almost certainly the cause. The most common scenario: a dying SSD with bad sectors causes filesystem corruption on the fresh install within days or weeks. Our guide to reading SMART data and predicting SSD failure covers how to identify a dying drive before reinstalling onto it.

5. WHEA Logger Event ID 18 in Windows Event Viewer

WHEA (Windows Hardware Error Architecture) is a dedicated kernel subsystem that reads hardware error registers directly from the CPU's Machine Check Architecture. Event ID 18 logged by Microsoft-Windows-WHEA-Logger means the CPU detected a hardware fault in its error registers — not a software exception, not a driver error, but an actual hardware-level fault reported by processor hardware registers. This is one of the highest-confidence hardware indicators available through Windows without additional diagnostic tools.

Common causes: RAM running at XMP/EXPO profiles that exceed the memory controller's validated limits, failing RAM modules, degraded PCIe lanes, or aging CPU with marginal voltage stability. Disabling XMP/EXPO in BIOS and running at JEDEC-rated memory speeds is the first step before concluding hardware is defective — XMP misconfiguration triggers WHEA events and is not a hardware failure. For a complete Event ID reference covering all WHEA severity levels, Kernel-Power ID 41, and disk errors, see our Windows Event Viewer hardware diagnostics guide.

6. Problem is temperature-correlated

If a machine fails after 30-60 minutes of use and recovers after cooling down, and the pattern repeats reliably, the cause is thermal. Software does not become less stable as temperature rises. A CPU hitting its 95°C thermal limit, a GPU junction reaching 83°C, a VRM at 105°C — these produce crashes and instability that look like software failures in their visible symptoms but reveal themselves through their consistent timing and temperature correlation. See our complete guide to diagnosing PC crashes under load for the full thermal troubleshooting methodology.

5 Software Signals That Are Almost Never Hardware

1. Problem appeared immediately after a specific update or installation

Clear temporal correlation between a software change and the onset of problems is among the strongest software indicators. A Windows Update that triggers BSODs the next day, a new GPU driver that causes repeated crashes, an application that corrupts itself during a botched update — these have a traceable cause that can often be reversed. Rolling back the specific driver or update resolves the issue by definition.

2. Problem affects only specific applications, not the system

Hardware failures are cross-cutting. A failing RAM module corrupts memory for every process that uses the affected address range — all applications are equally at risk. A problem that crashes Premiere Pro but leaves Photoshop, the browser, and Excel running stably is almost never a hardware problem. Hardware failures do not know which application is running.

3. Problem disappears in Safe Mode over extended use

A machine that runs stable through 30-60 minutes of Safe Mode use has significantly narrowed the cause to third-party drivers or services loaded at boot. Follow-up: identify which driver or service is present in full boot mode but absent in Safe Mode. Device Manager, the Services console, and startup applications in Task Manager are the places to look.

4. Windows Event Log shows application errors, not kernel errors

Windows Application Event Log errors in the 1000-1023 range indicate application-level crashes, not kernel or hardware faults. Event ID 1000 with a faulting application name (chrome.exe, premiere.exe, AutoCAD.exe) is application software crashing, with no hardware subsystem involvement. Compare this with WHEA Event 18 (hardware error register) or Event ID 7 (disk bad block) — the source and event ID tell you which layer of the system failed.

5. A specific file or process name appears in the BSOD

BSODs that name a specific file — particularly files outside ntoskrnl.exe and core Windows system components — often indicate a driver issue. A crash naming nvlddmkm.sys is the NVIDIA graphics driver. igdkmd64.sys is the Intel integrated graphics driver. mwac.sys is a Malwarebytes kernel component. These named files identify the software component that caused the crash, which is the starting point for the fix: update, rollback, or remove that driver.

The 6-Step Diagnostic Sequence

When signals are mixed or ambiguous, run this sequence in order. Stop when you have enough confidence to act.

  1. Check Event Viewer first. Open eventvwr.msc, navigate to Windows Logs → System, and filter for Critical and Error events around the time of the crash. Look for WHEA-Logger Event 18 (hardware), Event ID 41 with a non-zero BugcheckCode (decode the decimal value to hex using Microsoft's Bug Check Code Reference), and Event ID 7 (disk bad block). Any of these in the right timeframe changes the diagnosis immediately. For a complete reference of all hardware-relevant Event IDs, see our Windows Event Viewer hardware diagnostics guide.

  2. Boot into Safe Mode and stress the system. Run the machine in Safe Mode for 30-60 minutes under typical tasks. If it crashes in Safe Mode, hardware is the primary suspect. If it remains stable, the cause is almost certainly a third-party driver or service.

  3. Check SMART data on every storage device. Run CrystalDiskInfo (free) on all drives. Look specifically for Reallocated Sectors Count (ID 05), Pending Sectors (ID C5), and Uncorrectable Sectors (ID C6). Non-zero values in any of these three attributes indicate physical storage media degradation. Our guide to reading SMART data and predicting SSD failure explains what each attribute means and when to act.

  4. Run MemTest86 for RAM. Boot from a USB drive running MemTest86 (free from Passmark). Run a minimum of 2 passes. Any errors indicate a hardware problem in the memory subsystem. MemTest86's own documentation notes that intermittent faults may require 4+ passes to detect. See our RAM failure signs and testing guide for the complete MemTest86 workflow and how to interpret results.

  5. Monitor sensor data under load. Run a combined CPU and GPU stress test while watching sensor data in HWiNFO64. CPU temperature above 90°C sustained, GPU junction above 83°C, or 12V rail reading below 11.4V under load all point to hardware causes — thermal failure or PSU degradation — that manifest as crashes without naming any specific software component.

  6. Compare against fleet baseline if managing multiple machines. A single crashing machine could be anything. Three machines from the same hardware batch crashing with the same symptom pattern within two weeks is a hardware signal. Fleet pattern matching shortcuts the process significantly.

The PSU Problem: The Most Underdiagnosed Hardware Cause

Every diagnostic guide covers RAM, storage, and GPU as hardware suspects. The PSU is underreported as a root cause despite being responsible for a significant share of hardware-mimicking failures.

A failing PSU with a degraded 12V rail produces:

  • VIDEO_TDR_FAILURE BSODs — looks like a GPU driver crash
  • MEMORY_MANAGEMENT BSODs — looks like RAM failure
  • Sudden restarts under combined CPU+GPU load — indistinguishable from OS crash
  • Random system instability with no consistent pattern — impossible to correlate to a specific component

The PSU is the hardest hardware component to diagnose without equipment. A multimeter measuring the 12V rail under load is the most accessible test — below 11.4V under peak load is the threshold for concern. Swapping with a known-good PSU of equal or higher wattage is the definitive test.

The diagnostic clue: problems that only appear under combined CPU and GPU load but not under single-component load. The PSU is the only component that serves both. Our guide to signs your power supply is failing covers the voltage measurements and failure modes in detail.

Fleet Diagnosis: When Pattern Matching Changes Everything

Individual machine diagnosis is investigative work. Fleet diagnosis is statistical. In our monitoring data across over 500 managed machines, hardware failures show patterns that are invisible when examining a single machine in isolation:

  • Same hardware batch + similar age + similar symptom onset = component lot issue or batch firmware bug
  • Multiple machines at one site + voltage instability = building power quality issue
  • Temperature-correlated failures across multiple machines in summer months = facility cooling degrading before any individual machine fails definitively

GGFix surfaces these fleet patterns automatically — the fleet dashboard identifies which machines share the same hardware components as a machine that just failed, which machines are showing anomalous temperature trends, and which machines have generated hardware-type event logs recently.

For MSPs managing 50+ client machines, this fleet context is the difference between reactive break-fix and predictive maintenance. See our PC hardware monitoring guide for the sensor types and data points that feed this analysis.

How Continuous Monitoring Changes the Diagnosis

The 6-step sequence above is reactive — it starts after a crash has already occurred. Continuous hardware monitoring adds a layer before the crash: 30 days of sensor history that records what the hardware was doing before the failure.

Temperature trends over 30 days answer thermal-correlation questions without requiring a controlled test environment. CPU power draw trends identify machines consuming significantly more power for the same workload than 3 months earlier — a known early indicator of component degradation. Fan RPM trends identify fans whose speed has declined over months before they stop entirely.

When a machine crashes and you have 30 days of sensor history, the hardware-vs-software question often answers itself. If CPU temperatures were climbing steadily over 3 weeks before the crash, thermal failure is the cause. If temperatures were flat and normal right up to the failure event, thermal cause is effectively ruled out and software becomes the more likely suspect.

Frequently Asked Questions

Q: Can hardware problems cause software-looking error messages?

Yes — this is the core reason hardware-vs-software diagnosis is difficult. A failing RAM module causes memory corruption that produces application crashes and error messages that look identical to software bugs. A dying SSD causes filesystem corruption that manifests as OS errors and failed Windows updates. A degrading PSU causes GPU driver timeouts and MEMORY_MANAGEMENT errors. The error message names the software component that encountered the corrupted data, not the hardware component that corrupted it.

Q: Will reinstalling Windows fix a hardware problem?

No. Reinstalling Windows eliminates software as a variable, which is useful for diagnosis, but it cannot fix hardware failure. A fresh Windows install on a failing SSD will develop the same filesystem corruption within weeks. A clean install on a machine with failing RAM will exhibit the same application crashes once the affected memory address range is accessed. Reinstalling Windows is a diagnostic step — if problems resume on a fresh install within a few weeks, the hardware is the cause.

Q: How do I tell if a BSOD is hardware or software?

The stop code is the first clue. Codes like 0x124 WHEA_UNCORRECTABLE_ERROR, 0x7A KERNEL_DATA_INPAGE_ERROR (storage failure), or multiple different codes across crashes point to hardware. Codes that consistently name a specific driver file — 0xD1 DRIVER_IRQL_NOT_LESS_OR_EQUAL with nvlddmkm.sys, 0xC4 DRIVER_VERIFIER_DETECTED_VIOLATION — point to software. The key pattern: multiple different stop codes on the same machine across multiple crashes is a strong indicator of RAM or PSU instability, not a software bug.

Q: What does Safe Mode actually prove about hardware vs. software?

A machine that crashes in Safe Mode is almost certainly experiencing hardware failure. A machine that is stable in Safe Mode has narrowed the cause to a third-party driver or service — but this does not definitively rule out hardware. Safe Mode runs at reduced power draw with no dedicated GPU driver. GPU failures under load, PSU failures under combined load, and thermal failures are all masked in Safe Mode's reduced-load environment.

Q: Can overheating cause what looks like software errors?

Yes. When a CPU reaches its thermal limit (95°C for AMD Ryzen 7000/9000, 100°C for Intel 13th/14th gen), it reduces clock speed and may cause system instability that manifests as application crashes, GPU driver timeouts, or BSODs. These symptoms look software-like because they generate Windows error messages naming software components. The thermal correlation — failures after 30-60 minutes of use, recovery after cooling — is the diagnostic signal. Monitoring CPU temperature under load with HWiNFO64 during the failure condition confirms or rules out thermal cause within a single test session.

Q: Is there a Windows tool that diagnoses hardware for free?

Yes, several. Event Viewer (eventvwr.msc) surfaces WHEA hardware errors, disk bad block events, and kernel panic codes. Windows Memory Diagnostic (mdsched.exe) performs a basic RAM test on reboot. Reliability Monitor (perfmon /rel) shows a timeline of system failures and changes. CrystalDiskInfo (free third-party) reads SMART data from every storage drive. HWiNFO64 (free third-party) shows real-time sensor data including temperatures, voltages, and fan RPMs. For RAM beyond Windows Memory Diagnostic, MemTest86 (bootable, free from Passmark) is the standard technician tool.

GGFix Hardware Monitoring

Find out if your hardware has problems right now.

GGFix monitors 50+ sensors per machine plus the top 25 processes every minute, decodes BSODs into plain English, and pushes alerts to your phone in under 10 seconds.

  • 3-day free trial — no credit card, 1 machine included
  • Installs silently as a Windows Service (2 minutes)
  • 50+ sensors + top 25 processes monitored every minute
  • Auto-decodes BSODs and Event IDs 41 / 1001 / 219 / WHEA
  • AI names the exact app that caused any crash or spike
  • Telegram or email alerts in under 10 seconds
Start Monitoring Free
$20/mo · $200/yr (2 months free) · cancel anytime
What does ignoring this actually cost?
ScenarioTypical cost (USD)
Emergency repair after hardware failure$300 – $1,500
Data recovery (worst case)$500 – $2,500
Lost workday per incident$150 – $800
Preventive maintenance (if flagged early)$30 – $130
GGFix monitoring (per machine / month)$20
GGFix monitoring (per machine / year — 2 months free)$200

Early warning is the cheapest insurance you can buy. GGFix catches problems when the fix is still cheap — and names the exact app, sensor, or BSOD code responsible.

Start Monitoring Free — 3 Days
1 machine · no card required · 2 minutes to install
G

GGFix Technical Team

Writing about hardware monitoring, fleet management, and keeping machines alive. Powered by GGFix.

[ free 3-day trial · no credit card ]

Know before it breaks.

GGFix installs in 2 minutes and starts watching your hardware immediately — CPU temps, GPU load, disk health, fan speeds, and 50+ sensors. AI tells you what's wrong before it causes damage.

3 days freeNo credit cardSetup in 2 minCancel anytime

We use essential cookies to make this site work. With your consent we also use analytics (Google Analytics) and error reporting (Sentry) to improve the product. See our Cookie Policy and Privacy Policy.