PC Keeps Crashing Under Load: Diagnostic Guide

Your CPU might be throttling right now and you'd never know.
Sustained temperatures above 85°C shorten CPU lifespan and tank performance — silently. GGFix watches every sensor (including the hotspot most tools hide) and alerts you the moment a reading drifts above its 30-day baseline, not just when it crosses a static threshold.
Start 3-Day Free TrialNo card requiredA PC that runs fine at idle but crashes under load is not exhibiting a software problem — it is exhibiting a hardware problem that software cannot hide when the system is pushed hard. The difference between a crash that appears random and a crash with a measurable cause is sensor data. With the right tools, load-based crashes are almost always diagnosable within a single stress test session.
This guide covers the six most common causes of load-based crashes, the sensor readings that identify each one, the exact diagnostic sequence we use when servicing machines in our fleet, and what the same investigation looks like when continuous monitoring has already captured the crash for you. For context on how these relate to the broader troubleshooting landscape, see our complete PC troubleshooting guide.
Why Crashes Happen Under Load (and Not at Idle)
Idle conditions are forgiving. At idle, a CPU draws 10-20W, a GPU draws 5-15W, and total system power is 40-80W. Thermal output is low, voltage demand is minimal, and the PSU is running well within its capacity. Failures hide easily at this power state.
Under sustained load, everything changes. A CPU gaming load draws 65-250W. A GPU under FurMark or a game draws 200-450W. Total system power can reach 600-900W on a high-end build. At this point:
- Temperatures approach component limits
- PSU voltage rails must deliver peak current continuously
- RAM runs at maximum frequency with sustained memory pressure
- VRMs regulate power delivery at their thermal ceiling
Every weak point in the hardware chain is exposed under this kind of load. The component that cannot sustain it causes the crash.
The Six Causes — and Their Sensor Signatures
| Cause | Crash type | Key sensor to check | Threshold |
|---|---|---|---|
| GPU thermal shutdown | Instant off or TDR error | GPU temperature | Core >95°C, hotspot >110°C |
| CPU thermal throttle + crash | Freeze or BSOD | CPU package temp | Approaching TjMax (95-105°C) |
| PSU voltage instability | Instant off, no warning | 12V rail reading | Below 11.4V under load |
| VRM thermal shutdown | Freeze or sudden off | VRM temperature | Above 100-110°C |
| RAM instability | BSOD or freeze at memory ops | Memory error logs | Any correctable errors |
| GPU driver TDR timeout | Screen goes black, recovers or BSOD | GPU utilization + driver log | TDR event in Event Viewer |
1. GPU Thermal Shutdown
This is the most common cause of load crashes in gaming PCs and workstations. The GPU approaches its maximum safe temperature and either triggers an emergency shutdown or causes a display driver timeout (TDR — Timeout Detection and Recovery) when the GPU becomes unresponsive from thermal throttling.
Nvidia RTX 40-series cards have a target throttle temperature of 83°C and begin emergency protection above 95-100°C. AMD RX 7000-series cards have junction temperatures rated to 110°C, but edge temperatures should stay below 90°C under typical loads. When these limits are exceeded, the result is either a sudden system shutdown, a black screen with driver recovery, or a full BSOD referencing the display adapter.
Diagnostic test: Run FurMark for 10-15 minutes while watching GPU temperature in HWiNFO64. If the GPU temperature reaches 90°C+ and the system crashes during or shortly after, GPU thermal shutdown is the cause. Check both edge temperature and hotspot — the edge may look acceptable at 85°C while the hotspot runs 20°C higher at 105°C. Our complete guide to GPU overheating signs covers the hotspot vs edge distinction in full.
Common fixes: Clean dust from GPU fans and heatsink fins; ensure case airflow allows GPU exhaust to exit freely; replace GPU thermal paste on cards over 3-4 years old; verify GPU fans spin during load.
2. CPU Thermal Throttle into Crash
Unlike thermal shutdown, where the system cuts power cleanly, a CPU that hits its throttling ceiling during compute-intensive work sometimes causes system instability instead. When the CPU drops from 5.0 GHz to 800 MHz mid-computation, the timing assumptions that running code relies on can fail — especially in applications with real-time components (audio, video, game engines).
The signature: crashes during CPU-intensive tasks (compiling, video encoding, physics simulations) but not during GPU-bound work. CPU package temperature was at or above 90°C before the crash.
Diagnostic test: Run Prime95 Small FFT (maximum CPU heat) for 15 minutes while logging CPU temperature and core clocks in HWiNFO64. If clocks drop significantly (3.0 GHz or below on a CPU rated for 5.0 GHz) and the system crashes or freezes, thermal throttling is the mechanism. Verify whether the throttling is caused by temperature (replace thermal paste, clean heatsink) or by inadequate cooling for the CPU's TDP.
3. PSU Voltage Instability Under Combined Load
A PSU that delivers stable voltage at 300W may fail to maintain ATX tolerances at 700W. The 12V rail must stay within ±5% (11.4V–12.6V). Below 11.4V, the system detects a brownout and cuts power immediately — no throttling, no warning, just instant off.
This cause is particularly common when:
- CPU and GPU are both under maximum load simultaneously (gaming is harder on the PSU than CPU-only or GPU-only workloads)
- The PSU is undersized for the build's combined power draw
- Capacitors have aged (PSUs over 4-5 years old on high-end builds)
- The PSU is operating in a hot environment (elevated ambient raises effective power loss)
Diagnostic test: Monitor the CPU 12V reading in HWiNFO64 (found under the Motherboard sensors section) while running a combined CPU + GPU stress test (OCCT's Power stress test is designed specifically for this). A 12V reading that drops below 11.5V under combined load indicates a PSU problem. Note: motherboard-reported 12V readings are less accurate than direct multimeter measurement, but they capture relative changes reliably.
4. VRM Thermal Shutdown
The voltage regulator modules on the motherboard convert 12V power to the lower voltages the CPU requires. Under sustained high CPU power draw, VRMs generate significant heat. Budget and mid-range motherboards paired with high-TDP CPUs are particularly vulnerable: the VRM design cannot sustain the current without overheating.
VRM shutdown typically occurs during CPU-only workloads (rendering, compilation) rather than gaming, because gaming distributes load between CPU and GPU. Our VRM temperature guide covers the exact temperature thresholds — above 100°C is the danger zone for most VRM designs.
Diagnostic test: Run Prime95 Large FFT (maximum CPU power draw, tests VRM more than CPU thermal capacity) while monitoring VRM temperatures in HWiNFO64. Look for MOSFET temperatures labeled as VRM, VDDCR, or similar — nomenclature varies by motherboard. If VRM temperatures exceed 100°C and the system crashes, the VRM cannot sustain the load.
Common fixes: Improve airflow toward the VRM area; add a dedicated small fan blowing across VRM heatsinks on affected builds; if the board lacks VRM heatsinks on a high-TDP build, that is a board incompatibility issue requiring either a motherboard upgrade or CPU TDP reduction via power limits in BIOS.
5. RAM Instability
RAM that appears stable at desktop use fails at specific memory access patterns or at the sustained bandwidth demand of games and GPU compute tasks. DDR5 running above its rated XMP/EXPO profile is the most common source — many systems have XMP enabled but the specific frequency or timing combination is unstable under real load.
The crash signature: BSOD with MEMORY_MANAGEMENT, IRQL_NOT_LESS_OR_EQUAL, or WHEA_UNCORRECTABLE_ERROR stop codes. Crashes occur inconsistently, sometimes after 30 minutes of gaming, sometimes after 3 hours.
Diagnostic test: Disable XMP/EXPO in BIOS and run the RAM at its rated JEDEC frequency (typically DDR5-4800 or DDR5-5600) for a gaming session. If crashes stop, the XMP profile was the issue — try a more conservative XMP profile or tighten timings manually. For physical RAM failure, run MemTest86 for 2+ full passes from a bootable USB drive.
6. GPU Driver TDR (Timeout Detection and Recovery)
TDR is Windows' safety mechanism for GPU hangs. When the GPU stops responding for more than 2 seconds, the display driver attempts recovery. If recovery succeeds, the screen goes black briefly and returns with a notification: "Display driver stopped responding and has recovered." If recovery fails, the system BSODs with TDR_FAILURE.
TDR events can be caused by:
- GPU hardware instability (overheating, failing VRAM, damaged power delivery on the card)
- Driver bugs (particularly immediately after driver updates)
- PCIe power connector issues (6-pin/8-pin connectors not fully seated)
- GPU memory overclock instability
Diagnostic test: Check Event Viewer → Windows Logs → System for "nvlddmkm" (Nvidia) or "amdkmdap" (AMD) errors shortly before crash events. If TDR events appear, roll back the GPU driver to the previous stable version and test. If TDR events persist after driver rollback, the GPU hardware is likely failing — test in another PCIe slot, check power connector seating, and monitor GPU voltage and hotspot temperature.
The Diagnostic Sequence
When a machine reports crashes under load, follow this sequence:
Step 1: Read the Event Log first Open Event Viewer → Windows Logs → System. Look for Event ID 41 (instant power-off crashes), BSOD stop codes (recorded in Event ID 1001), and TDR events (nvlddmkm/amdkmdap). The crash type narrows the suspect list before any testing. For the full Event ID lookup table and what each one means in hardware terms, see our Windows Event Viewer hardware diagnostics guide.
Step 2: Isolate the workload Does the machine crash during GPU stress only, CPU stress only, or combined load? Run FurMark alone (GPU) and Prime95 alone (CPU) before running both simultaneously. The workload that triggers the crash tells you which component to investigate first.
Step 3: Log sensors throughout the test Enable HWiNFO64 sensor logging before each stress test. After a crash, open the log and examine the 60 seconds before the event. Identify which sensor value was highest relative to its limit at the moment of crash.
Step 4: Check voltages under peak load During a combined stress test, watch the CPU 12V, CPU core voltage, and VRM temperatures simultaneously. Voltage sag under peak load (12V dropping below 11.5V) points to the PSU. VRM temperatures above 100°C point to VRM thermal shutdown.
Step 5: Test RAM at stock speeds Disable XMP and test for a full gaming session. If crashes stop, re-enable XMP and reduce the memory frequency by one step. Find the highest stable XMP frequency through bisection testing.
Step 6: Confirm the fix After any repair or adjustment, run the same stress test combination that previously caused the crash for 30 minutes. No crash + sensor readings within safe ranges = confirmed fix.
What This Looks Like With Continuous Monitoring
The diagnostic sequence above assumes you can reproduce the crash and instrument the next one. That is the hard part. Real users almost never report "my PC crashed at 14:32 yesterday during Cyberpunk" with sensor data attached — they report "my PC keeps crashing." Reproducing the conditions can take hours or days, and intermittent crashes (the worst kind) often refuse to fire on demand.
With continuous monitoring already running, the investigation collapses into a single look at history. Here is the same diagnostic done after the fact, on a workstation we monitor:
14:32:07 — Sensor snapshot at the second of the crash:
- GPU edge: 87°C, hotspot: 108°C — above safe range
- CPU package: 78°C — fine
- CPU 12V rail: 11.92V — fine
- VRM: 76°C — fine
- Top process by GPU memory: Cyberpunk2077.exe (8.4 GB VRAM)
- Process running for: 47 minutes
14:32:09 — System restart detected (Event ID 41, kernel-power 'unexpected shutdown')
14:32:11 — Auto-decoded Event Log entry:
- Event ID 41, source: Kernel-Power, BugcheckCode: 0x0
- Plain-language decode: "system lost power without a clean shutdown — thermal protection or PSU brownout"
- Cross-referenced with the GPU hotspot reading: thermal protection trigger, not a PSU event
14:32:14 — Telegram alert delivered to the user:
⚠️ GGFix: WORKSTATION-04 just crashed during Cyberpunk2077.exe. GPU hotspot hit 108°C (limit 110°C) — thermal shutdown. CPU and PSU were fine. Most likely cause: GPU heatsink dust or failing thermal paste. Open the dashboard for the 60-second sensor history.
The technician now starts at the answer instead of working toward it. The post-crash investigation that would normally take an hour of stress testing and log spelunking takes ninety seconds: open the dashboard, see the captured sensor history and the named process, schedule the cleaning. For environments where multiple users share machines, or fleets where physical access is impractical, this is the difference between solving load-based crashes and being permanently behind on them.
The same flow surfaces the harder cases too. A crash with GPU temperature 72°C, CPU 12V 11.31V under combined load points squarely at the PSU — no stress test required to prove it. A crash with MEMORY_MANAGEMENT BSOD and the top process by RAM growing 2 GB in the prior 5 minutes points at a memory leak in the running app, not a hardware RAM fault. We cover the per-process leak detection mechanism in detail in our memory leak detection on Windows guide.
Frequently Asked Questions
Q: How do I know if my PC is crashing from GPU overheating or a driver bug?
Check Event Viewer for TDR events (nvlddmkm or amdkmdap errors) and cross-reference with GPU temperature logs. If GPU temperature was above 90°C and TDR events appear, thermal shutdown triggered the driver timeout — it is a hardware thermal problem dressed as a driver crash. If GPU temperature was normal and TDR events appear after a recent driver update, it is a software issue: roll back the driver.
Q: My PC only crashes when gaming, not during stress tests. Why?
Games combine high GPU load, moderate-to-high CPU load, large memory footprint, and continuous disk I/O simultaneously — often the most demanding combination for the PSU and VRMs. Dedicated stress tests like FurMark and Prime95 run one subsystem at maximum, which may fall below the combined power draw of a modern AAA game. Try OCCT's Power stress test, which is specifically designed to maximize combined CPU and GPU load simultaneously.
Q: What is the difference between a BSOD crash and an instant power-off crash?
A BSOD (Blue Screen of Death) is a controlled crash — Windows detected a fatal error, logged it to a dump file, and shut down gracefully. An instant power-off (no BSOD, just instant black) is an uncontrolled event triggered by hardware protection — usually a PSU brownout, CPU/GPU thermal emergency shutdown, or VRM shutdown. BSOD crashes point to driver or RAM issues. Instant power-off crashes point to thermal or power delivery failures. Event ID 41 in Windows Event Viewer is the giveaway for the latter.
Q: How do I know which app caused my PC to crash?
Windows itself does not record which application was running at the moment of crash for unexpected shutdowns (Event ID 41) — by definition, the system did not have time to log it. The only reliable way to answer this is per-process history captured continuously in the background. A monitoring agent that records the top processes by CPU, RAM, and GPU memory every minute can tell you that Cyberpunk was using 8.4 GB of VRAM at 14:32 when the GPU hit 108°C and the system died. Without that record, you are guessing from the time of day and what the user remembers having open.
Q: Can a bad PCIe cable cause crashes under GPU load?
Yes. The 12V power connectors (6-pin, 8-pin, or 16-pin on RTX 4000/5000-series) must be fully seated. A partially seated connector delivers reduced current, causing voltage sag under full GPU load. RTX 4000/5000-series cards use the 16-pin (12VHPWR) connector — there are documented cases of thermal damage to this connector from improper seating causing GPU power delivery failures under sustained load. Inspect all GPU power connectors physically before assuming the GPU card itself is faulty.
Q: How long should I run a stress test to confirm stability?
For an initial diagnostic, 15-20 minutes of isolated GPU stress (FurMark) and 15-20 minutes of isolated CPU stress (Prime95 Small FFT) is sufficient to identify thermal failures and obvious instability. For confirming a fix as stable, run at least 30 minutes of combined load. For RAM stability verification after XMP changes, a full gaming session of 2+ hours is more reliable than synthetic tools, since real games hit specific memory access patterns that synthetic tests may miss.
Q: My PC crashes but all temperatures look fine. What am I missing?
Three things software monitoring commonly misses: (1) GPU hotspot temperature — the hotspot runs 15-25°C above the reported core/edge temperature and is the actual failure trigger; check hotspot specifically in HWiNFO64. (2) PSU voltage — CPU and system voltages under load, especially the 12V rail during peak combined draw. (3) Per-core CPU temperatures — a single core spiking to TjMax triggers throttling and potential instability even when the package average looks acceptable.
Is your PC throttling under load without telling you?
GGFix watches every temperature sensor — including the GPU hotspot most tools hide — and catches thermal problems before components degrade. AI alerts name which workload caused the spike.
- 3-day free trial — no credit card, 1 machine included
- Installs silently as a Windows Service (2 minutes)
- 50+ sensors + top 25 processes monitored every minute
- Auto-decodes BSODs and Event IDs 41 / 1001 / 219 / WHEA
- AI names the exact app that caused any crash or spike
- Telegram or email alerts in under 10 seconds
| Scenario | Typical cost (USD) |
|---|---|
| CPU/GPU replacement after thermal failure | $400 – $2,500 |
| Emergency technician callout | $120 – $350 |
| Lost workday (thermal throttling undetected) | $200 – $600 |
| Thermal paste + cleaning (early warning) | $30 – $100 |
| GGFix monitoring (per machine / month) | $20 |
| GGFix monitoring (per machine / year — 2 months free) | $200 |
Early warning is the cheapest insurance you can buy. GGFix catches problems when the fix is still cheap — and names the exact app, sensor, or BSOD code responsible.
GGFix Technical Team
Writing about hardware monitoring, fleet management, and keeping machines alive. Powered by GGFix.
Related Articles
GPU Artifacts: What They Look Like and What Causes Them
GPU artifacts range from fixable driver issues to signs of permanent VRAM damage. Here is how to identify which type you have, what temperatures trigger them, and whether your graphics card is recoverable.
PC Maintenance Schedule: The Complete Checklist (Daily to Annual)
The complete PC maintenance schedule for businesses — weekly, monthly, quarterly, and annual tasks with time estimates, environment adjustments, and the real cost of skipping it.
NVIDIA RTX 4060–5090: Temperature Limits by Model
RTX 4090 and RTX 5090 have different temperature limits. The hotspot temperature runs 15-25°C above the core temperature every card reports. Most monitoring setups only watch the core — which means most monitoring misses the actual failure threshold. Here are the exact numbers for every RTX card.
[ free 3-day trial · no credit card ]
Know before it breaks.
GGFix installs in 2 minutes and starts watching your hardware immediately — CPU temps, GPU load, disk health, fan speeds, and 50+ sensors. AI tells you what's wrong before it causes damage.