HardwarePC troubleshooting diagnostics crashes under load GPU thermal

PC Keeps Crashing Under Load: Diagnostic Guide

GGFix Technical Team

6 April 202615 min read202 views

PC Keeps Crashing Under Load: Diagnostic Guide

GGFix monitors this 24/7

Your CPU might be throttling right now and you'd never know.

Sustained temperatures above 85°C shorten CPU lifespan and tank performance — silently. GGFix watches every sensor (including the hotspot most tools hide) and alerts you the moment a reading drifts above its 30-day baseline, not just when it crosses a static threshold.

Start 3-Day Free TrialNo card required

A PC that runs fine at idle but crashes under load is not exhibiting a software problem — it is exhibiting a hardware problem that software cannot hide when the system is pushed hard. The difference between a crash that appears random and a crash with a measurable cause is sensor data. With the right tools, load-based crashes are almost always diagnosable within a single stress test session.

This guide covers the six most common causes of load-based crashes, the sensor readings that identify each one, the exact diagnostic sequence we use when servicing machines in our fleet, and what the same investigation looks like when continuous monitoring has already captured the crash for you. For context on how these relate to the broader troubleshooting landscape, see our complete PC troubleshooting guide.

Why Crashes Happen Under Load (and Not at Idle)

Idle conditions are forgiving. At idle, a CPU draws 10-20W, a GPU draws 5-15W, and total system power is 40-80W. Thermal output is low, voltage demand is minimal, and the PSU is running well within its capacity. Failures hide easily at this power state.

Under sustained load, everything changes. A CPU gaming load draws 65-250W. A GPU under FurMark or a game draws 200-450W. Total system power can reach 600-900W on a high-end build. At this point:

Temperatures approach component limits
PSU voltage rails must deliver peak current continuously
RAM runs at maximum frequency with sustained memory pressure
VRMs regulate power delivery at their thermal ceiling

Every weak point in the hardware chain is exposed under this kind of load. The component that cannot sustain it causes the crash.

The Six Causes — and Their Sensor Signatures

Two high-performance graphics cards with cooling fans

Cause	Crash type	Key sensor to check	Threshold
GPU thermal shutdown	Instant off or TDR error	GPU temperature	Core >95°C, hotspot >110°C
CPU thermal throttle + crash	Freeze or BSOD	CPU package temp	Approaching TjMax (95-105°C)
PSU voltage instability	Instant off, no warning	12V rail reading	Below 11.4V under load
VRM thermal shutdown	Freeze or sudden off	VRM temperature	Above 100-110°C
RAM instability	BSOD or freeze at memory ops	Memory error logs	Any correctable errors
GPU driver TDR timeout	Screen goes black, recovers or BSOD	GPU utilization + driver log	TDR event in Event Viewer

1. GPU Thermal Shutdown

This is the most common cause of load crashes in gaming PCs and workstations. The GPU approaches its maximum safe temperature and either triggers an emergency shutdown or causes a display driver timeout (TDR — Timeout Detection and Recovery) when the GPU becomes unresponsive from thermal throttling.

Nvidia RTX 40-series cards have a target throttle temperature of 83°C and begin emergency protection above 95-100°C. AMD RX 7000-series cards have junction temperatures rated to 110°C, but edge temperatures should stay below 90°C under typical loads. When these limits are exceeded, the result is either a sudden system shutdown, a black screen with driver recovery, or a full BSOD referencing the display adapter.

Diagnostic test: Run FurMark for 10-15 minutes while watching GPU temperature in HWiNFO64. If the GPU temperature reaches 90°C+ and the system crashes during or shortly after, GPU thermal shutdown is the cause. Check both edge temperature and hotspot — the edge may look acceptable at 85°C while the hotspot runs 20°C higher at 105°C. Our complete guide to GPU overheating signs covers the hotspot vs edge distinction in full.

Common fixes: Clean dust from GPU fans and heatsink fins; ensure case airflow allows GPU exhaust to exit freely; replace GPU thermal paste on cards over 3-4 years old; verify GPU fans spin during load.

2. CPU Thermal Throttle into Crash

Unlike thermal shutdown, where the system cuts power cleanly, a CPU that hits its throttling ceiling during compute-intensive work sometimes causes system instability instead. When the CPU drops from 5.0 GHz to 800 MHz mid-computation, the timing assumptions that running code relies on can fail — especially in applications with real-time components (audio, video, game engines).

The signature: crashes during CPU-intensive tasks (compiling, video encoding, physics simulations) but not during GPU-bound work. CPU package temperature was at or above 90°C before the crash.

Diagnostic test: Run Prime95 Small FFT (maximum CPU heat) for 15 minutes while logging CPU temperature and core clocks in HWiNFO64. If clocks drop significantly (3.0 GHz or below on a CPU rated for 5.0 GHz) and the system crashes or freezes, thermal throttling is the mechanism. Verify whether the throttling is caused by temperature (replace thermal paste, clean heatsink) or by inadequate cooling for the CPU's TDP.

3. PSU Voltage Instability Under Combined Load

A PSU that delivers stable voltage at 300W may fail to maintain ATX tolerances at 700W. The 12V rail must stay within ±5% (11.4V–12.6V). Below 11.4V, the system detects a brownout and cuts power immediately — no throttling, no warning, just instant off.

This cause is particularly common when:

CPU and GPU are both under maximum load simultaneously (gaming is harder on the PSU than CPU-only or GPU-only workloads)
The PSU is undersized for the build's combined power draw
Capacitors have aged (PSUs over 4-5 years old on high-end builds)
The PSU is operating in a hot environment (elevated ambient raises effective power loss)

Diagnostic test: Monitor the CPU 12V reading in HWiNFO64 (found under the Motherboard sensors section) while running a combined CPU + GPU stress test (OCCT's Power stress test is designed specifically for this). A 12V reading that drops below 11.5V under combined load indicates a PSU problem. Note: motherboard-reported 12V readings are less accurate than direct multimeter measurement, but they capture relative changes reliably.

4. VRM Thermal Shutdown

The voltage regulator modules on the motherboard convert 12V power to the lower voltages the CPU requires. Under sustained high CPU power draw, VRMs generate significant heat. Budget and mid-range motherboards paired with high-TDP CPUs are particularly vulnerable: the VRM design cannot sustain the current without overheating.

VRM shutdown typically occurs during CPU-only workloads (rendering, compilation) rather than gaming, because gaming distributes load between CPU and GPU. Our VRM temperature guide covers the exact temperature thresholds — above 100°C is the danger zone for most VRM designs.

Diagnostic test: Run Prime95 Large FFT (maximum CPU power draw, tests VRM more than CPU thermal capacity) while monitoring VRM temperatures in HWiNFO64. Look for MOSFET temperatures labeled as VRM, VDDCR, or similar — nomenclature varies by motherboard. If VRM temperatures exceed 100°C and the system crashes, the VRM cannot sustain the load.

Common fixes: Improve airflow toward the VRM area; add a dedicated small fan blowing across VRM heatsinks on affected builds; if the board lacks VRM heatsinks on a high-TDP build, that is a board incompatibility issue requiring either a motherboard upgrade or CPU TDP reduction via power limits in BIOS.

5. RAM Instability

RAM that appears stable at desktop use fails at specific memory access patterns or at the sustained bandwidth demand of games and GPU compute tasks. DDR5 running above its rated XMP/EXPO profile is the most common source — many systems have XMP enabled but the specific frequency or timing combination is unstable under real load.

The crash signature: BSOD with MEMORY_MANAGEMENT, IRQL_NOT_LESS_OR_EQUAL, or WHEA_UNCORRECTABLE_ERROR stop codes. Crashes occur inconsistently, sometimes after 30 minutes of gaming, sometimes after 3 hours.

Diagnostic test: Disable XMP/EXPO in BIOS and run the RAM at its rated JEDEC frequency (typically DDR5-4800 or DDR5-5600) for a gaming session. If crashes stop, the XMP profile was the issue — try a more conservative XMP profile or tighten timings manually. For physical RAM failure, run MemTest86 for 2+ full passes from a bootable USB drive.

6. GPU Driver TDR (Timeout Detection and Recovery)

TDR is Windows' safety mechanism for GPU hangs. When the GPU stops responding for more than 2 seconds, the display driver attempts recovery. If recovery succeeds, the screen goes black briefly and returns with a notification: "Display driver stopped responding and has recovered." If recovery fails, the system BSODs with TDR_FAILURE.

TDR events can be caused by:

GPU hardware instability (overheating, failing VRAM, damaged power delivery on the card)
Driver bugs (particularly immediately after driver updates)
PCIe power connector issues (6-pin/8-pin connectors not fully seated)
GPU memory overclock instability

Diagnostic test: Check Event Viewer → Windows Logs → System for "nvlddmkm" (Nvidia) or "amdkmdap" (AMD) errors shortly before crash events. If TDR events appear, roll back the GPU driver to the previous stable version and test. If TDR events persist after driver rollback, the GPU hardware is likely failing — test in another PCIe slot, check power connector seating, and monitor GPU voltage and hotspot temperature.

The Diagnostic Sequence

When a machine reports crashes under load, follow this sequence:

Step 1: Read the Event Log first Open Event Viewer → Windows Logs → System. Look for Event ID 41 (instant power-off crashes), BSOD stop codes (recorded in Event ID 1001), and TDR events (nvlddmkm/amdkmdap). The crash type narrows the suspect list before any testing. For the full Event ID lookup table and what each one means in hardware terms, see our Windows Event Viewer hardware diagnostics guide.

Step 2: Isolate the workload Does the machine crash during GPU stress only, CPU stress only, or combined load? Run FurMark alone (GPU) and Prime95 alone (CPU) before running both simultaneously. The workload that triggers the crash tells you which component to investigate first.

Step 3: Log sensors throughout the test Enable HWiNFO64 sensor logging before each stress test. After a crash, open the log and examine the 60 seconds before the event. Identify which sensor value was highest relative to its limit at the moment of crash.

Step 4: Check voltages under peak load During a combined stress test, watch the CPU 12V, CPU core voltage, and VRM temperatures simultaneously. Voltage sag under peak load (12V dropping below 11.5V) points to the PSU. VRM temperatures above 100°C point to VRM thermal shutdown.

Step 5: Test RAM at stock speeds Disable XMP and test for a full gaming session. If crashes stop, re-enable XMP and reduce the memory frequency by one step. Find the highest stable XMP frequency through bisection testing.

Step 6: Confirm the fix After any repair or adjustment, run the same stress test combination that previously caused the crash for 30 minutes. No crash + sensor readings within safe ranges = confirmed fix.

If you are in Copenhagen and would rather not work through this sequence yourself, GGFix offers fixed-price crash and blue-screen diagnosis that follows exactly this process.

What This Looks Like With Continuous Monitoring

The diagnostic sequence above assumes you can reproduce the crash and instrument the next one. That is the hard part. Real users almost never report "my PC crashed at 14

yesterday during Cyberpunk" with sensor data attached — they report "my PC keeps crashing." Reproducing the conditions can take hours or days, and intermittent crashes (the worst kind) often refuse to fire on demand.

With continuous monitoring already running, the investigation collapses into a single look at history. Here is the same diagnostic done after the fact, on a workstation we monitor:

14:32

— Sensor snapshot at the second of the crash:

GPU edge: 87°C, hotspot: 108°C — above safe range
CPU package: 78°C — fine
CPU 12V rail: 11.92V — fine
VRM: 76°C — fine
Top process by GPU memory: Cyberpunk2077.exe (8.4 GB VRAM)
Process running for: 47 minutes

14:32

— System restart detected (Event ID 41, kernel-power 'unexpected shutdown')

14:32

— Auto-decoded Event Log entry:

Event ID 41, source: Kernel-Power, BugcheckCode: 0x0
Plain-language decode: "system lost power without a clean shutdown — thermal protection or PSU brownout"
Cross-referenced with the GPU hotspot reading: thermal protection trigger, not a PSU event

14:32

— Telegram alert delivered to the user:

⚠️ GGFix: WORKSTATION-04 just crashed during Cyberpunk2077.exe. GPU hotspot hit 108°C (limit 110°C) — thermal shutdown. CPU and PSU were fine. Most likely cause: GPU heatsink dust or failing thermal paste. Open the dashboard for the 60-second sensor history.

The technician now starts at the answer instead of working toward it. The post-crash investigation that would normally take an hour of stress testing and log spelunking takes ninety seconds: open the dashboard, see the captured sensor history and the named process, schedule the cleaning. For environments where multiple users share machines, or fleets where physical access is impractical, this is the difference between solving load-based crashes and being permanently behind on them.

The same flow surfaces the harder cases too. A crash with GPU temperature 72°C, CPU 12V 11.31V under combined load points squarely at the PSU — no stress test required to prove it. A crash with MEMORY_MANAGEMENT BSOD and the top process by RAM growing 2 GB in the prior 5 minutes points at a memory leak in the running app, not a hardware RAM fault. We cover the per-process leak detection mechanism in detail in our memory leak detection on Windows guide.

Frequently Asked Questions

Q: How do I know if my PC is crashing from GPU overheating or a driver bug?

Check Event Viewer for TDR events (nvlddmkm or amdkmdap errors) and cross-reference with GPU temperature logs. If GPU temperature was above 90°C and TDR events appear, thermal shutdown triggered the driver timeout — it is a hardware thermal problem dressed as a driver crash. If GPU temperature was normal and TDR events appear after a recent driver update, it is a software issue: roll back the driver.

Q: My PC only crashes when gaming, not during stress tests. Why?

Games combine high GPU load, moderate-to-high CPU load, large memory footprint, and continuous disk I/O simultaneously — often the most demanding combination for the PSU and VRMs. Dedicated stress tests like FurMark and Prime95 run one subsystem at maximum, which may fall below the combined power draw of a modern AAA game. Try OCCT's Power stress test, which is specifically designed to maximize combined CPU and GPU load simultaneously.

Q: What is the difference between a BSOD crash and an instant power-off crash?

A BSOD (Blue Screen of Death) is a controlled crash — Windows detected a fatal error, logged it to a dump file, and shut down gracefully. An instant power-off (no BSOD, just instant black) is an uncontrolled event triggered by hardware protection — usually a PSU brownout, CPU/GPU thermal emergency shutdown, or VRM shutdown. BSOD crashes point to driver or RAM issues. Instant power-off crashes point to thermal or power delivery failures. Event ID 41 in Windows Event Viewer is the giveaway for the latter.

Q: How do I know which app caused my PC to crash?

Windows itself does not record which application was running at the moment of crash for unexpected shutdowns (Event ID 41) — by definition, the system did not have time to log it. The only reliable way to answer this is per-process history captured continuously in the background. A monitoring agent that records the top processes by CPU, RAM, and GPU memory every minute can tell you that Cyberpunk was using 8.4 GB of VRAM at 14

when the GPU hit 108°C and the system died. Without that record, you are guessing from the time of day and what the user remembers having open.

Q: Can a bad PCIe cable cause crashes under GPU load?

Yes. The 12V power connectors (6-pin, 8-pin, or 16-pin on RTX 4000/5000-series) must be fully seated. A partially seated connector delivers reduced current, causing voltage sag under full GPU load. RTX 4000/5000-series cards use the 16-pin (12VHPWR) connector — there are documented cases of thermal damage to this connector from improper seating causing GPU power delivery failures under sustained load. Inspect all GPU power connectors physically before assuming the GPU card itself is faulty.

Q: How long should I run a stress test to confirm stability?

For an initial diagnostic, 15-20 minutes of isolated GPU stress (FurMark) and 15-20 minutes of isolated CPU stress (Prime95 Small FFT) is sufficient to identify thermal failures and obvious instability. For confirming a fix as stable, run at least 30 minutes of combined load. For RAM stability verification after XMP changes, a full gaming session of 2+ hours is more reliable than synthetic tools, since real games hit specific memory access patterns that synthetic tests may miss.

Q: My PC crashes but all temperatures look fine. What am I missing?

Three things software monitoring commonly misses: (1) GPU hotspot temperature — the hotspot runs 15-25°C above the reported core/edge temperature and is the actual failure trigger; check hotspot specifically in HWiNFO64. (2) PSU voltage — CPU and system voltages under load, especially the 12V rail during peak combined draw. (3) Per-core CPU temperatures — a single core spiking to TjMax triggers throttling and potential instability even when the package average looks acceptable.

GGFix Hardware Monitoring

Is your PC throttling under load without telling you?

GGFix watches every temperature sensor — including the GPU hotspot most tools hide — and catches thermal problems before components degrade. AI alerts name which workload caused the spike.

3-day free trial — no credit card, 1 machine included
Installs silently as a Windows Service (2 minutes)
50+ sensors + top 25 processes monitored every minute
Auto-decodes BSODs and Event IDs 41 / 1001 / 219 / WHEA
AI names the exact app that caused any crash or spike
Telegram or email alerts in under 10 seconds

Start Monitoring Free

$20/mo · $200/yr (2 months free) · cancel anytime

What does ignoring this actually cost?

Scenario	Typical cost (USD)
CPU/GPU replacement after thermal failure	$400 – $2,500
Emergency technician callout	$120 – $350
Lost workday (thermal throttling undetected)	$200 – $600
Thermal paste + cleaning (early warning)	$30 – $100
GGFix monitoring (per machine / month)	$20
GGFix monitoring (per machine / year — 2 months free)	$200

Early warning is the cheapest insurance you can buy. GGFix catches problems when the fix is still cheap — and names the exact app, sensor, or BSOD code responsible.

Start Monitoring Free — 3 Days

1 machine · no card required · 2 minutes to install

On-site PC & laptop repair · Copenhagen

In Copenhagen with this exact problem? GGFix fixes it hands-on — often cheaper than replacing the machine.

Fixed prices from 399 DKK for graphics card repair, all brands, on-site or drop-off in Ishøj — with an honest diagnosis before you commit to anything.

See graphics card repair prices

GGFix Technical Team

Writing about hardware monitoring, fleet management, and keeping machines alive. Powered by GGFix.

Previous10 Reasons Your PC Is Slow (And How to Fix Each One)

NextHow AI Is Changing Hardware Monitoring in 2026

Hardware

GPU Artifacts: What They Look Like and What Causes Them

GPU artifacts range from fixable driver issues to signs of permanent VRAM damage. Here is how to identify which type you have, what temperatures trigger them, and whether your graphics card is recoverable.

7 Apr 202617m

Hardware

PC Maintenance Schedule: The Complete Checklist (Daily to Annual)

The complete PC maintenance schedule for businesses — weekly, monthly, quarterly, and annual tasks with time estimates, environment adjustments, and the real cost of skipping it.

7 Apr 202621m

Hardware

NVIDIA RTX 4060–5090: Temperature Limits by Model

RTX 4090 and RTX 5090 have different temperature limits. The hotspot temperature runs 15-25°C above the core temperature every card reports. Most monitoring setups only watch the core — which means most monitoring misses the actual failure threshold. Here are the exact numbers for every RTX card.

6 Apr 202612m

[ free 3-day trial · no credit card ]

Know before it breaks.

GGFix installs in 2 minutes and starts watching your hardware immediately — CPU temps, GPU load, disk health, fan speeds, and 50+ sensors. AI tells you what's wrong before it causes damage.

3 days freeNo credit cardSetup in 2 minCancel anytime

Start Free Trial →See how it works

X / Twitter LinkedIn Facebook

PC Keeps Crashing Under Load: Diagnostic Guide

Why Crashes Happen Under Load (and Not at Idle)

The Six Causes — and Their Sensor Signatures

1. GPU Thermal Shutdown

2. CPU Thermal Throttle into Crash

3. PSU Voltage Instability Under Combined Load

4. VRM Thermal Shutdown

5. RAM Instability

6. GPU Driver TDR (Timeout Detection and Recovery)

The Diagnostic Sequence

What This Looks Like With Continuous Monitoring

Frequently Asked Questions

Q: How do I know if my PC is crashing from GPU overheating or a driver bug?

Q: My PC only crashes when gaming, not during stress tests. Why?

Q: What is the difference between a BSOD crash and an instant power-off crash?

Q: How do I know which app caused my PC to crash?

Q: Can a bad PCIe cable cause crashes under GPU load?

Q: How long should I run a stress test to confirm stability?

Q: My PC crashes but all temperatures look fine. What am I missing?

Is your PC throttling under load without telling you?

Related Articles

GPU Artifacts: What They Look Like and What Causes Them

PC Maintenance Schedule: The Complete Checklist (Daily to Annual)

NVIDIA RTX 4060–5090: Temperature Limits by Model

Know before it breaks.

Share

Tags