Understanding GPU Metrics
gpulse surfaces a wide range of GPU metrics across its views. This guide explains what each metric measures, what normal values look like, and when to be concerned.
Memory
Used / Total / Free
GPU memory (VRAM) is separate from system RAM. gpulse reports three values:
- Used: bytes currently allocated by all processes on that GPU
- Total: the physical VRAM capacity (e.g., 24 GiB for an RTX 4090)
- Free:
total - used— headroom before the next allocation may fail
Memory is shown both as an absolute value (e.g., 18.4 GiB / 24.0 GiB) and as a percentage bar.
Out-of-Memory (OOM) Risk
When free memory falls below a safe threshold, the CUDA runtime raises a cudaErrorMemoryAllocation that typically crashes the calling process. There is no automatic swap for GPU memory — the process dies.
gpulse's Predict view forecasts time-to-OOM based on the current growth rate, giving you time to intervene before a crash.
GPU Utilization (SM Occupancy)
Measures what fraction of the GPU's Streaming Multiprocessors (SMs) were executing at least one warp during the sampling window. Expressed as 0-100%.
| Range | Meaning |
|---|---|
| 0-5% | Idle or very light load |
| 5-50% | Moderate workload, data-pipeline bottleneck likely |
| 50-95% | Healthy compute-bound workload |
| 95-100% | Fully saturated; expected for training runs |
A process with high memory usage but low utilization is often stalled on I/O or CPU preprocessing, not actively computing.
Temperature
GPU core temperature in degrees Celsius, read from the onboard thermal sensor.
| Range | Status | Color in gpulse |
|---|---|---|
| < 70 C | Normal | Green |
| 70-85 C | Warm — monitor closely | Yellow |
| > 85 C | Hot — thermal throttling likely | Red |
Thermal Throttling
When the GPU exceeds its thermal threshold (usually 83-87 C), the driver reduces the core clock to bring temperature down. This appears as a sudden drop in utilization and throughput. If you see unexplained performance degradation alongside a red temperature reading, throttling is the likely cause.
Actions: improve chassis airflow, check that fans are spinning, reduce batch size, or increase scheduling gaps between runs.
Power
Power Draw vs. Power Limit
- Draw: current instantaneous power consumption in watts
- Limit: the TDP cap configured for the device (factory TDP or a manually set limit via
nvidia-smi -pl)
When draw approaches the limit, the driver enforces the cap by reducing clocks — called power-limit throttling. This is normal on high-density servers where multiple GPUs share a power budget. gpulse shows both values so you can see headroom before throttling kicks in.
Process Table
The process table (visible in Detail and List views) shows every process holding GPU memory:
| Column | Description |
|---|---|
| PID | Process identifier |
| Name | Executable name |
| Memory | VRAM allocated by this process |
| User | OS user owning the process |
Use this to attribute memory to specific training jobs, inference servers, or notebooks. It is the starting point for leak investigation — see Leak Detection.
ECC (Error-Correcting Code) Memory
Enterprise and data-centre GPUs (A100, H100, V100, etc.) include ECC hardware. gpulse reports:
- Corrected errors: single-bit errors fixed automatically. A low, steady count is normal. A rising count may indicate aging VRAM.
- Uncorrected errors: double-bit errors ECC could not fix. Any uncorrected error is serious — typically crashes the affected process with a hardware error. Persistent occurrences warrant an RMA.
ECC metrics are only populated on hardware that supports them. Consumer GPUs (RTX series) do not expose ECC counts.