What signs indicate a GPU is degrading while still “working”?

Degradation usually appears as a gradual decline: for the same workload temperature rises, the card more often drops clocks, correctable ECC errors appear, and the share of time spent in throttling increases. Watch trends and compare GPUs to each other rather than waiting for a full failure.

Why use DCGM if nvidia-smi exists?

DCGM is convenient because it provides a single, comparable set of metrics and quick health checks for a GPU on the server. It shortens diagnosis time by showing temperature, clocks, power, memory errors, reasons for throttling and driver status in one place.

How do health checks differ from monitoring metrics?

Health checks answer “is everything okay right now?” and help quickly catch obvious problems like a missing GPU, critical errors or driver issues. Metrics are used to answer “is it getting worse over time?” so you can spot rising temperatures, accumulating ECC errors and increasing throttling before an incident.

Which GPU metrics should I collect first?

A practical minimal set is: temperature, power draw and power limit, actual clocks under load and the reasons for throttling, plus utilization and basic memory metrics. This is usually enough to distinguish overheating from power limits and to spot hidden performance loss.

How should ECC be interpreted: SBE vs DBE?

SBE (single-bit) are usually corrected, but a steady increase is an early sign of memory or operating-condition problems and should be treated as a warning. DBE (double-bit) are uncorrectable and often lead to crashes, so DBE occurrence is typically a reason to remove the card from critical workloads and plan diagnostics.

How do I tell if throttling is actually affecting performance?

Don't look at single spikes only — check duration and recurrence, and whether effective clocks drop under the same load. If thermal throttling rises with temperature, it points to cooling; if power throttling dominates while temperature is normal, check power limits and profile settings.

What does the DCGM → exporter → Prometheus chain usually look like?

Typical flow: DCGM runs on the node, DCGM-Exporter exposes metrics, Prometheus or a compatible collector scrapes exporters, then data goes to storage and dashboards with alerts. It's important metrics are collected the same way on all nodes, otherwise comparisons and thresholds will be noisy.

How to choose windows and thresholds so alerts don’t flap?

Always set a time window and a firing delay to filter out short spikes, and separate warning and critical levels. Practically, have fast rules on 5–10 minute windows for incidents and longer windows over hours or days for trends to catch degradation rather than normal load peaks.

Which labels in metrics are essential for quick incident investigation?

Use identifiers that don't change when cards are moved, for example `gpu_uuid`, and add node and physical location context like `hostname` and `pcie_bus_id` or slot. Then an alert immediately shows which card and where it is, so engineers don't have to guess which “GPU 0” is which.

What to do when an alert for temperature, ECC or throttling fires?

First, confirm the symptom in metrics and compare with neighboring GPUs on the same node. Then check simple causes like airflow, inlet temperature, fans and power limits. If you see DBE, repeated Xid events or persistent throttling with clock drops, move workloads off the card and send it for diagnostics. For 24/7 processes, integrating with an external integrator often helps standardize response.

Monitoring NVIDIA GPUs in the Data Center: DCGM and Alerts

What GPU problems should you catch early

GPU degradation rarely shows up as a sudden failure. More often it’s a slow decline: the card still works but runs hotter, drops clocks more often, and memory errors appear gradually. Catching this early lets you replace the card or move workloads before users and deadlines are affected.

SLA risks include not only hardware failures but quiet performance loss. Overheating triggers protective clock reductions and makes inference or training noticeably slower. Memory errors (ECC) can crash jobs, restart containers, and, less often, corrupt results. Persistent throttling turns planned cluster usage into a queue.

It’s convenient to build GPU monitoring in the data center in three layers:

node level: power, cooling, driver and overall stability;
each GPU: which specific card is starting to behave worse;
application: business outcome — longer runtimes, more retries, lower throughput.

Early signals worth catching before an incident:

rising temperature at the same load and fan speed;
appearance of correctable ECC errors, especially if the counter grows daily;
frequent throttling events and falling effective clocks under load;
increasing power draw at the same performance (hinting at cooling or power issues);
unstable utilization: sawtooths, drops, unexpected idles.

A real example: tasks in one rack started taking 10–15% longer without errors. With per-GPU temperature, ECC and throttling metrics, the cause was usually found quickly: one GPU began overheating and dropping clocks more often while the server as a whole looked fine.

DCGM in simple terms: what it gives in a data center

DCGM (Data Center GPU Manager) is NVIDIA’s toolset for operating GPUs in servers. It turns a GPU into a well-defined monitoring object: status, metric history and quick health checks. DCGM often becomes the base layer for dashboards and alerts.

Its strength is standardizing data collection and making metrics comparable between nodes and models. Instead of scattered outputs from different utilities you get one source of metrics and unified checks.

Health checks vs metrics: what’s the difference

Health checks answer: is everything OK right now? These are quick tests and statuses that help spot an obvious problem fast: a missing GPU, critical errors, driver issues.

Trend metrics answer: is it getting worse over time? Here graphs over days and weeks matter: temperature at the same load, rising correctable ECC, clock changes, percentage of time in throttling.

DCGM provides data at multiple levels, which helps narrow diagnosis:

whole GPU: temperature, utilization, clocks, power;
memory: size, errors;
MIG (if enabled): metrics per instance;
NVLink and PCIe: errors and throughput;
power and limits: power cap and reasons for clock reductions.

A common cause of wrong conclusions is missing workload context. High temperature is not always a hardware fault: it can be sustained 100% compute with normal airflow. Conversely, low utilization with high power draw may point to wrong power limits or background processes. So it’s useful to look at several signals together.

Exporters and monitoring architecture: how to tie everything together

In monitoring, the metric itself matters, but so does the path it takes to dashboards and alerts. In a Prometheus-based setup this usually looks like: DCGM on the node reads GPU state, and DCGM-Exporter exposes it as HTTP metrics.

A typical chain is:

node with GPU: NVIDIA driver + DCGM;
exporter: DCGM-Exporter;
collector (Prometheus or compatible agent): scrapes exporters;
metric storage: keeps history;
dashboards and alerts: charts, thresholds, notifications.

To avoid drowning in data, separate metrics by level: per-GPU to catch a specific card’s degradation, per-process to see who uses GPU and takes memory, per-node for context (air inlet temperature, fans, PCIe errors).

Plan labels separately. Good labels save investigation hours. The minimum that usually pays off: hostname, gpu_index and uuid, serial (if available) or an inventory ID, slot/pcie_bus_id, pool/role (training, inference, vdi).

If you deliver clusters turnkey, link these labels to inventory and support. Then an alert shows not just GPU 0 but the specific card in a specific slot and pool.

Basic GPU metrics: what to watch first

Start with metrics that most quickly reveal overheating, power limits and hidden performance loss. These signals often appear hours or days before an obvious failure.

Temperature is not a single number. See GPU core temperature, memory temperature (if available) and hot spot. Sometimes the core looks fine while memory is already overheating due to dust, dried thermal pads or poor airflow in the rack.

Next — power draw and power limits. A pair of power draw vs power limit helps spot if the card is throttling because it hit a power cap.

Clocks and the reasons they dropped are more useful than just current frequency. DCGM usually shows why the card lowered clocks: thermal, power or reliability reasons. That quickly narrows the cause.

Utilization is often misleading. High utilization is normal for training or inference. The problem is when, at the same utilization, job runtime grows while clocks drop or throttling increases.

Another early signal is PCIe: throughput and transfer errors (retries, correctable errors). Growing retries on one node at the same load may hint at a bad contact, riser degradation or slot issues.

Keep these in view for a quick start:

temperatures (GPU, hot spot, memory);
power draw vs power limit;
clocks and reasons for throttling;
GPU utilization and memory utilization;
PCIe throughput and errors/retries.

For example, on an S200 Series server you might see memory on one card become 8–10 °C hotter than neighbors at the same load, followed by short thermal throttling events. That is usually a sign to investigate cooling for that position before ECC errors or job failures start.

ECC and memory metrics: how to notice problems before crashes

ECC on GPUs catches memory errors and corrects some of them on the fly. For a data center it is one of the earliest signs of degradation.

Key distinction: SBE and DBE. SBE (single-bit error) are usually corrected and may not immediately affect a job, but their growth almost always indicates something wrong with that card or its operating conditions. DBE (double-bit error) are uncorrectable and often end with driver crashes, process failures or required restarts. DBE almost always count as an incident.

Besides SBE/DBE counters, monitor retired pages — memory pages taken out of use. Pull driver Xid events from logs as well: a single Xid can be a fluke, but repeated Xid on one GPU together with rising ECC is a clear trend.

How to separate noise from degradation:

look at the increase rate (errors per hour or per day), not only absolute values;
compare GPUs within the same node and across identical nodes: an outlier matters;
correlate spikes with temperature, power and load;
consider history: growth after months of stability is more worrying than occasional errors from day one.

Workload mode affects the picture. Training has higher load and heat, so hidden problems show sooner. During inference errors may accumulate more slowly.

Also consider reboots: some counters may reset after a reboot or driver crash. So focus on growth rate in time series.

Small example

If SBE grows only on GPU-2 and the growth coincides with periods when memory temperature stays higher than usual, and retired pages appear, that's a good candidate for early replacement or workload migration even if DBE hasn't occurred yet.

Throttling and clocks: how to know a GPU is losing performance

Integrate DCGM with monitoring

We will integrate metric collection and alerts into your Prometheus ecosystem.

Order integration

Throttling is when the GPU reduces clocks to stay within safe limits. From the outside it looks like everything runs but slower. Utilization may stay high while effective throughput drops.

Common throttling causes in a data center: thermal, power limits, reliability modes, and cases where the device goes into a power-saving mode at low load. In DCGM this is visible via throttling reason counters and current graphics/memory clocks.

Clock capping and low power limits

Sometimes the issue is not hardware but settings. Hard clock caps and too-low power limits can make a cold GPU slow. If you see frequent power-throttle events at normal temperature, check power limits and the profile. This is especially relevant after driver updates, firmware changes, PSU swaps or moving a server to another rack.

How to distinguish energy-saving behavior from cooling degradation

Normal energy saving looks like low load, low power draw, clocks fluctuating but quickly returning to expected values when load increases.

Cooling degradation looks different: stable load, rising temperature, repeated thermal throttling, and clocks staying below usual for the same workload.

Look at duration and recurrence rather than single spikes. It helps to evaluate the share of time in throttling over 5–15 minutes and compare with a baseline for that GPU model and workload. If the same server starts thermal-throttling more often on identical tasks, it’s an early sign: dust, worsened airflow, failing fans, or dried thermal paste.

Step-by-step: how to set up DCGM monitoring and alerts

Start by asking: what exactly do you want to catch? Typical goals are three: incidents (a GPU truly fails), degradation (performance drops without an obvious failure) and capacity (are cluster resources enough?).

1) Collect DCGM metrics and agree on labels

Install DCGM and the exporter on every GPU node so metrics are collected consistently. Agree on labels upfront or comparisons will be hard. Commonly useful labels: cluster, site, node, gpu_uuid, gpu_index, model, driver, pool.

Then pick a base metric set: temperature, power, utilization, clocks, ECC, throttling reasons. Better to track fewer metrics consistently across all nodes than many inconsistent ones.

2) Set thresholds and time windows

Thresholds without windows produce noise. Usually three horizons are enough: 5 minutes (fast incident), 30 minutes (sustained degradation), 24 hours (trend). A brief temperature spike is one thing; sustained exceedance is a cooling problem.

3) Dashboards: node and pool

Create two kinds of dashboards: per-node (all GPUs and their differences) and per-pool (distribution of temperatures, clocks and errors across nodes). This makes it easier to spot an outlier GPU.

4) Alerts without flapping and with clear routing

To make alerts useful instead of annoying:

add a firing delay (10–15 minutes for many degradation rules);
separate levels (warning and critical);
group alerts by node and gpu_uuid to avoid dozens of identical notifications;
send ECC and sudden clock drops to the on-call team, and daily trends to planned work queues;
predefine the owner: data center ops, ML team, vendor/integrator.

Alert examples for temperature, ECC and throttling

Solutions for AI and data centers

We will design AI and data center infrastructure: from nodes to service network.

Discuss project

Below are Prometheus alert examples if metrics are collected via DCGM exporter. Logic is simple: warning catches early problems, critical means immediate action is needed.

groups:
- name: gpu-alerts
  rules:
  # Температура: разные окна для warning/critical
  - alert: GPU_Temperature_Warning
    expr: max_over_time(DCGM_FI_DEV_GPU_TEMP[10m]) > 80
    for: 10m
    labels: {severity: warning}
    annotations:
      summary: "GPU температура повышена"
      description: "Проверьте охлаждение узла (воздух, фильтры, вентиляторы), нагрузку, соседние GPU. Сравните с другими картами в сервере."

  - alert: GPU_Temperature_Critical
    expr: max_over_time(DCGM_FI_DEV_GPU_TEMP[2m]) > 90
    for: 2m
    labels: {severity: critical}
    annotations:
      summary: "GPU перегревается"
      description: "Срочно: уменьшить нагрузку/перенести джобы, проверить airflow в стойке и состояние кулеров. Риск throttling и аварийных остановок."

  # ECC: DBE сразу, SBE по росту за период
  - alert: GPU_ECC_DBE_Detected
    expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[5m]) > 0
    labels: {severity: critical}
    annotations:
      summary: "ECC DBE ошибка на GPU"
      description: "DBE обычно приводит к падениям. Зафиксируйте GPU/сервер, проверьте логи драйвера, планируйте диагностику и замену."

  - alert: GPU_ECC_SBE_Growing
    expr: increase(DCGM_FI_DEV_ECC_SBE_VOL_TOTAL[6h]) > 100
    for: 30m
    labels: {severity: warning}
    annotations:
      summary: "ECC SBE ошибки растут"
      description: "Это ранний сигнал деградации памяти или условий работы. Сравните рост по GPU, проверьте температуру и питание, запланируйте тест."

  # Throttling: доля времени в ограничениях и постоянный cap
  - alert: GPU_Thermal_Throttling_Ratio
    expr: (rate(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) / 1e6) > 0.05
    for: 10m
    labels: {severity: warning}
    annotations:
      summary: "GPU часто уходит в thermal throttling"
      description: "Более 5% времени за последние минуты GPU ограничивает частоты из-за температуры. Проверьте охлаждение и плотность размещения."

  - alert: GPU_Power_Cap_Constant
    expr: (rate(DCGM_FI_DEV_POWER_VIOLATION[15m]) / 1e6) > 0.10
    for: 30m
    labels: {severity: warning}
    annotations:
      summary: "GPU часто упирается в power cap"
      description: "Если это не ожидаемая настройка, проверьте лимиты мощности, блоки питания и распределение нагрузки по GPU."

Combined alerts are useful when a single symptom isn’t conclusive. For example, high temperature plus rising thermal throttling almost always means a cooling issue. Rising ECC (especially SBE) together with repeated Xid driver logs warrants faster removal of the card from critical workloads.

Include not only what happened but the first step in the alert: where to confirm (metrics, driver logs), what to do in 5 minutes (reduce load, check airflow, compare with neighbor GPUs), how to assess scope (single GPU, server, rack) and when to escalate (DBE immediately, repeated Xid, persistent throttling).

Realistic example: finding one GPU’s degradation

On a node epoch time suddenly rose: training became 12–18% slower though load and code were unchanged. The dashboard showed one GPU heating more than the others.

Metrics looked like: faulty GPU temperature ~83–86 °C while neighbors were 70–74 °C. Thermal throttling events rose: SM and memory clocks dropped even though utilization stayed high.

This differs from power cap: with power cap you see power usage hit the power limit, throttling reasons are dominated by PWR, and temperature may remain normal.

A second signal appeared: corrected ECC slowly increased on the faulty GPU, and there were single uncorrected events on two runs. Even correctable ECC can cause slowdowns: more read retries, higher latency, occasional driver instability.

Actions taken:

checked fans and airflow (RPM, filters, flow direction, inlet temperature);
stopped jobs and cleaned the node of dust;
checked heatsink contact and thermal interface, replaced thermal paste;
moved GPU to a neighboring slot to exclude local chassis heating;
if ECC continued rising after temperature normalized, planned GPU replacement.

Result confirmed by before/after trends: temperature dropped to 72–75 °C, thermal throttling share went near zero, clocks stabilized, epoch time returned to previous level. ECC counters stopped growing under load. This showed cooling was the root cause, not memory degradation.

Common mistakes and pitfalls in GPU monitoring

The most common mistake is reducing monitoring to a GPU utilization chart and stopping there. Utilization and power show what the card is doing but tell little about health. Degradation is more often visible in temperature, clocks under load, memory errors and repeated resets.

Second pitfall: identical thresholds for everything. A rack can contain different NVIDIA models, different workload profiles (training, inference, VDI) and different ventilation. Setting a single threshold like 80 °C = critical without model/context either creates many false alerts or misses real problems.

Third problem: metrics without hardware identity. If you don’t record uuid/serial and the server-slot-GPU mapping, it’s hard to find which card is degrading during an incident. This is critical when cards are swapped during maintenance.

Another pain is flapping: alerts that appear and disappear. Short throttling at every peak quickly becomes noise and teams stop reacting.

Finally, lack of history. Without 30–90 days of data you won’t see trends: temperature rising 1–2 °C per week, clocks hitting limits more often, ECC errors becoming regular.

Helpful practices:

separate performance and health alerts; create distinct alerts for temperature, throttling and ECC;
set thresholds by GPU model and workload, consider inlet air temperature;
bind metrics to serial numbers and installation location;
add delays and conditions to reduce flapping;
retain metrics long enough to spot slow degradation and plan replacements.

Short checklist: what to check in 10 minutes

DCGM pilot in your cluster

We will run a pilot on one pool and show how to catch degradation before an incident.

Start pilot

To quickly assess a node, compare the suspect server with a normal neighbor (same rack or at least the same model). Comparison often beats absolute thresholds.

Check five things:

GPU temperature and, if available, memory temperature: is there sustained overheating under load;
throttling: is the share of time in thermal or power limits growing;
ECC and retired pages: no DBE, and SBE not increasing every day;
Xid: no repeated events on the same GPU;
group skew: one card or one node should not deviate from others by temperature, ECC or throttling.

Rule of thumb: if one card is 8–12 °C hotter than neighbors at the same load and SBE and throttling also rise, move the node to a separate investigation queue before it affects production.

Next steps: standardize monitoring and support

Once metrics and alerts run, the main risk is everyone looking at things differently. You need a single standard so teams find degradation quickly, agree on thresholds and don't waste time in incidents.

First define metrics important for your cluster and SLA. For training, power and thermal throttling, ECC errors and clock drops are usually critical. For VDI and graphics, overheating, fan speed and memory stability matter more.

Then formalize the reaction process: an alert must have an owner and a clear resolution path — who gets notified, who confirms the issue, who can remove a GPU from the pool and in what time.

Minimum to document as a standard:

dashboard set: temperature, clocks, power, utilization, memory, ECC, throttling reasons;
alert set: critical (act now) and warnings (observe and schedule work);
a short runbook: what to check in 5 minutes, which logs to collect, when to escalate;
quarantine policy: how to mark a suspicious GPU and prevent new jobs on it;
regular trend reviews to catch slow degradation and plan replacement.

Preventive work often beats fine-tuning thresholds: cleaning, airflow control, fan checks and thermal interface inspection. If trends show rising temperature at the same load or increasing correctable ECC, schedule maintenance in advance.

If you need help unifying monitoring, operational standards and 24/7 support, this is typically the domain of a system integrator. GSE.kz (gse.kz) as a manufacturer and data center integrator can help build a GPU platform, organize support and set a unified monitoring standard for operations.