Understanding GPU Metrics

gpulse surfaces a wide range of GPU metrics across its views. This guide explains what each metric measures, what normal values look like, and when to be concerned.

Memory

Used / Total / Free

GPU memory (VRAM) is separate from system RAM. gpulse reports three values:

  • Used: bytes currently allocated by all processes on that GPU
  • Total: the physical VRAM capacity (e.g., 24 GiB for an RTX 4090)
  • Free: total - used — headroom before the next allocation may fail

Memory is shown both as an absolute value (e.g., 18.4 GiB / 24.0 GiB) and as a percentage bar.

Out-of-Memory (OOM) Risk

When free memory falls below a safe threshold, the CUDA runtime raises a cudaErrorMemoryAllocation that typically crashes the calling process. There is no automatic swap for GPU memory — the process dies.

gpulse's Predict view forecasts time-to-OOM based on the current growth rate, giving you time to intervene before a crash.

GPU Utilization (SM Occupancy)

Measures what fraction of the GPU's Streaming Multiprocessors (SMs) were executing at least one warp during the sampling window. Expressed as 0-100%.

Range Meaning
0-5% Idle or very light load
5-50% Moderate workload, data-pipeline bottleneck likely
50-95% Healthy compute-bound workload
95-100% Fully saturated; expected for training runs

A process with high memory usage but low utilization is often stalled on I/O or CPU preprocessing, not actively computing.

Temperature

GPU core temperature in degrees Celsius, read from the onboard thermal sensor.

Range Status Color in gpulse
< 70 C Normal Green
70-85 C Warm — monitor closely Yellow
> 85 C Hot — thermal throttling likely Red

Thermal Throttling

When the GPU exceeds its thermal threshold (usually 83-87 C), the driver reduces the core clock to bring temperature down. This appears as a sudden drop in utilization and throughput. If you see unexplained performance degradation alongside a red temperature reading, throttling is the likely cause.

Actions: improve chassis airflow, check that fans are spinning, reduce batch size, or increase scheduling gaps between runs.

Power

Power Draw vs. Power Limit

  • Draw: current instantaneous power consumption in watts
  • Limit: the TDP cap configured for the device (factory TDP or a manually set limit via nvidia-smi -pl)

When draw approaches the limit, the driver enforces the cap by reducing clocks — called power-limit throttling. This is normal on high-density servers where multiple GPUs share a power budget. gpulse shows both values so you can see headroom before throttling kicks in.

Process Table

The process table (visible in Detail and List views) shows every process holding GPU memory:

Column Description
PID Process identifier
Name Executable name
Memory VRAM allocated by this process
User OS user owning the process

Use this to attribute memory to specific training jobs, inference servers, or notebooks. It is the starting point for leak investigation — see Leak Detection.

ECC (Error-Correcting Code) Memory

Enterprise and data-centre GPUs (A100, H100, V100, etc.) include ECC hardware. gpulse reports:

  • Corrected errors: single-bit errors fixed automatically. A low, steady count is normal. A rising count may indicate aging VRAM.
  • Uncorrected errors: double-bit errors ECC could not fix. Any uncorrected error is serious — typically crashes the affected process with a hardware error. Persistent occurrences warrant an RMA.

ECC metrics are only populated on hardware that supports them. Consumer GPUs (RTX series) do not expose ECC counts.