Why pick specific metrics instead of collecting everything

Trying to monitor every metric at once almost always ends with missing the important signals. There are too many charts, attention gets scattered, and during an outage the team wastes time chasing causes among hundreds of similar signals.

The second problem is alerts. When alerts fire for every small event, they quickly lose trust. False alarms waste time, and real incidents drown in noise. The result is the same: the business sees downtime, users see “everything is slow.”

Useful infrastructure monitoring metrics are those that help make decisions and restore service quickly. Usually they answer simple questions:

Does this metric affect the user right now (speed, availability, errors)?
Can it indicate degradation early, not only full failure?
Is it clear who should act and what to do if the metric worsens?
Does the metric have a baseline so you can distinguish normal from problematic?

Example: a rack of servers (for example GSE S200 class) can expose dozens of hardware sensors. But without disk latency, CPU load, and network error metrics you won't understand why accounting complains about slow 1C even though "the hardware is alive."

There is also an important conceptual difference. Monitoring answers "what failed and when." Observability is broader: it helps understand "why it failed" by adding logs and traces to metrics.

How to decide which metrics you really need

Start not with charts but with what matters to the user: availability, response time and error rate. If a metric doesn't help answer "is the user satisfied or not", it often turns into noise.

A practical approach: separate signals (symptoms) from causes. Alert on symptoms because they reflect the problem "here and now." Causes are useful to have for investigation, but they shouldn't wake someone at night.

Signals that almost always work

A small set of universal signals usually detects degradation before complaints begin:

Latency (service response time, disk latency, network RTT)
Traffic (in/out, requests per second)
Errors (5xx, timeouts, job failures)
Saturation (CPU, memory, disk queue, link utilization)

It's helpful to break the service into layers: infrastructure (servers, network, storage), platform (DB, queues, virtualization) and the application itself.

If users complain about "slowness," put alerts on rising latency and errors at the service level, and use CPU, iowait and storage latency as hints where to look for the root cause.

Servers: a minimal metric set without extras

It’s easy to drown in hundreds of server metrics. It's more practical to keep a short set that answers two questions: is the server coping with the load, and is it hitting resource limits?

For CPU watch not only average utilization but also spikes. Short bursts to 100% can be normal. Persistent spikes along with rising load average often mean there are more tasks than the CPU can handle, or processes are waiting on disk. A healthy sign is stable load that does not consistently exceed the number of cores.

For memory it's important to distinguish real shortage from cache usage. High used memory alone is not scary if cache grows and the system doesn't go into swap. Bad signs are sustained high swap used, noticeable swap in/out and degraded response.

Disk is a frequent cause of “everything works but is slow.” Monitor utilization, IOPS and, most importantly, latency and queue depth. If latency rises at moderate IOPS, it's a signal of overload, storage array issues, or filesystem problems.

From the OS, keep process-level metrics: open file descriptors, threads, service restarts. These help catch leaks and recurring crashes.

Uptime and reboots are not KPIs, but they provide context. A spike in errors right after reboot often points to misconfiguration or startup problems.

Typical server alerts usually include a minimal set:

high swap and active swap in/out
rising disk latency or queue depth above usual
disk filling with a forecast of "will run out soon"
frequent restarts of critical processes
load average that steadily rises along with service degradation

Network: what to measure to catch degradation

Network rarely "fails completely" but often degrades: pages load slower, calls drop, databases respond with delay. So network monitoring must show not only availability but quality.

Start with interfaces: link up/down, packet loss, error counters. A link being "up" doesn't mean it's usable. 0.5–1% packet loss is already noticeable for voice, VPN and remote desktops.

Latency and jitter are sometimes more important than throughput. A link can be 30% utilized, but with port queuing jitter rises and users feel slowness. It’s useful to measure RTT to key points (gateway, core, upstream provider) and delay stability.

For link utilization watch in/out, peaks and the 95th percentile. The average can be misleading: overloads come as short bursts. The 95th percentile shows whether capacity is enough "almost always."

When investigating network degradation, first check interface errors and packet loss (errors, CRC), drops and queue drops, rising TCP retransmits at border nodes, and asymmetry (inbound ok, outbound congested). Collisions are rare today, but duplex issues can still cause them.

Don’t forget DNS and DHCP. Users perceive issues here as "internet is down" even when the network is up: slow DNS responses, delayed DHCP leases, or exhausted pools. A good early sign is rising DNS response time and an increase in failed queries.

Storage: capacity, latency and health

Storage often appears "normal" until space runs out or write latency increases. Choose storage metrics that answer three questions: will there be enough space tomorrow, is it slow now, and is hardware failing?

The minimal set that actually helps

Start with capacity and its dynamics. Not just "how much used" but the growth rate. Seeing the trend lets you predict when you'll hit limits and avoid nighttime emergencies.

Next — performance. For users “everything is slow” usually means rising latency rather than falling IOPS. Monitor read/write latency and queue depth: when queues grow, services wait on disk even at moderate load.

Practical minimum for storage:

pool or volume capacity: used, free, trend and forecast to full
read/write latency and average queue depth
IOPS and throughput, but only together with latency
array state: degraded, rebuild, hot spare, disk errors
thin provisioning and risk of pool overcommit

Backups and “hidden” degradations

Monitor backups not only as started/finished, but as a result and for their impact. Track duration, success/failure and whether they finish within their window.

A typical case: a nightly backup started taking longer and overlapped with the morning peak, causing write latency to grow 2–3×. Users complained about slowness while CPU and network looked healthy. Alerts on rising latency and backup duration would have signaled the problem earlier than user reports.

Services and applications: metrics the user sees

Rack and engineering calculation

We will calculate rack layout, power and cooling for stable server and storage operation.

Order calculation

Users don't care about your CPU and memory. They see simple things: did the page load, how fast, and were there errors. So infrastructure metrics should be complemented by service-level metrics that reflect real experience.

Availability is best measured with external checks (synthetic requests), not only by whether a process is alive. A process can be running but return 500 or hang on requests.

For latency measure p50, p95 and p99. The average often lies: most requests can be fast while the long tail causes complaints and support tickets.

Measure errors separately for 4xx and 5xx, plus timeouts and retries. Sometimes 5xx counts are low but timeouts are high, and the user experience is still poor.

Saturation is visible in queues and limits: task queue length, thread pool usage, active DB connections, exhaustion of limits.

To prevent hidden dependencies, add metrics for databases, queues and external APIs: response time, error rate, processing lag. For example, a service running on S200 servers might slow down due to slow DB queries. This shows up in p95 and rising timeouts even if CPU is mostly idle.

Platform: DB, queues, cache, virtualization and certificates

If you only watch CPU and RAM, you can easily miss platform-level issues. It's useful to see how the database, queues, cache, virtualization layer and even certificates behave.

Database, queues and cache

Databases commonly degrade due to growing connections, locks and slow queries. A spike in active connections and more lock waits often explains why a service is "alive" but responding much slower.

For queues and brokers monitor not only depth but lag and processing rate. A small queue can still have growing latency if consumers process messages slowly.

Cache gives fast hints: hit rate, evictions, latency and size. If hit rate drops and evictions rise, users notice slowness even if CPU looks fine.

Virtualization, containers and certificates

For containers and virtualization watch restarts and resource limits. A common story: containers get CPU throttling or memory pressure while host-level charts look normal.

Keep at least these checks to avoid surprises:

TLS certificate expiry (better weeks in advance)
rise in TLS/handshake errors
number of container restarts in a period
queue lag and processing time
DB locks and share of slow queries

Example: in an environment with GSE S200 racks a service became slow. The cause was not CPU but DB locks and a drop in cache hit rate after memory limits changed for containers.

Logs and traces: what to add to metrics

Metrics show that a resource is "alive" but don't always explain why the user is slow. Typical case: CPU and memory are fine, network has no loss, disks are green, but a web page takes 15 seconds to load.

The minimum logs that really help

Start with logs that answer "what exactly failed" and "when." The most useful entries are those that can be tied to a request and a timestamp.

A minimal log set usually includes:

errors and exceptions with code and short message
timeouts (to DB, external APIs, queues)
service starts/stops, restarts, crashes
long operations (e.g. requests longer than 1–2 seconds)
config changes and deploys (who and when)

For metrics and logs to converge in one incident you need simple correlation: same timezone, accurate timestamps and a common identifier (request_id, trace_id). Then you can see: at 10:14 latency rose and logs show timeouts to a specific database at 10:14.

When traces are most useful

Traces are especially helpful in microservices and sequences like "service -> DB -> external provider", where you need to know which step takes time. They show the request path across components and help distinguish "slow due to code" from "slow due to dependencies."

How to set up alerts step by step

Alerts without unnecessary triggers

We will set up page and ticket rules so alerts lead to action, not noise.

Request

Alerts work only when tied to meaning, not to every signal. First agree on what is considered a business or user-impacting problem.

A convenient setup in steps:

Map services: what depends on what and what is critical (payments, mail, records in EMIS, employee portal).
Pick 1–2 SLIs per service and look at 2–4 weeks of history: latency, error rate, availability, response time. Set thresholds based on real data, not round numbers.
Split alerts into two types: page (wakes the on-call, needs immediate action) and ticket (can be handled in business hours).
Add an "impact on users" check: a CPU alert is not a page until errors or response time rise.
Configure escalations, on-call and a "quiet window" for planned work to avoid false alarms.

Example: server load rises but users don’t complain. In this case create a ticket with the trend for investigation; keep page alerts only for combined signals (5xx errors + rising p95 latency).

Final step — test alerts in drills. Check whether the message is clear, has an owner, and whether you can see what to do within 5 minutes.

How to reduce noise and not miss real problems

Noise appears when one failure generates dozens of notifications and thresholds don’t reflect real life: backups, maintenance windows, seasonal peaks. The goal is simple: one incident — one clear signal, and only when the problem persists.

Quick wins often come from these rules:

Deduplication and grouping: combine alerts by service, host and failure type.
Delay before alerting: fire only if the state lasts N minutes or N consecutive checks.
Separate levels: warning for trends and critical for outages.
Time/event filters: special rules for backup windows, maintenance and night hours.
Dynamic thresholds where static ones hurt: e.g. for network traffic or CPU in bursty services.

A small example: nightly backups fill a disk to 95% and trigger a flood of I/O alerts. Solution: keep a warning for rising latency, but make critical alert fire only if high latency persists for, say, 10 minutes outside the backup window. This preserves real alerts and stops reacting to expected processes.

A useful check: if an alert doesn’t say what to do next (where to look and how urgent), it is almost always unnecessary.

Typical mistakes in metrics and alerts

The most common problem is not lack of data but that data doesn't answer: did the user get worse or not? Then alerts either fire constantly or stay silent until a real outage.

Mistake number one — alerting only on CPU, e.g. "80% and above." High utilization alone isn't always bad. More important is whether latency rose, queue length grew, or timeouts started. And vice versa: CPU can be normal while the service is slow due to disk or network waits.

Second mistake — monitoring current capacity but ignoring trends. Storage doesn't run out suddenly. "Suddenly" happens when no one watched the growth rate and no early warning was set.

Third mistake — checks that are too general. Ping and open port exist but the application inside is frozen or returns 500. Monitoring shows "green" while users complain.

Fourth and fifth mistakes often come together: an alert has no owner and no post-incident review. The same problems repeat and alert rules don't improve.

A simple example: an alert "CPU 85%" arrives at night. The on-call restarts a service, load drops, but it repeats an hour later. If response time and queue metrics were present, it would be clear the cause was slow storage, not the CPU.

To fix this without complexity, keep basic rules:

Tie alerts to user impact: latency, errors, queue, timeouts.
Add early warnings on capacity and growth rate, not just "5% left."
Check scenarios, not just "host alive" (a request, transaction or key operation).
Assign an owner and a clear runbook in the alert description.
After an incident adjust thresholds and conditions so it won't repeat.

Short checklist: what to check in 30 minutes

Platform for data and AI

Let's discuss infrastructure for AI, data storage and data centers.

Discuss

In half an hour you can tell whether monitoring helps or just "draws charts." This checklist ensures basic risks across hardware and services are covered and alerts don't become background noise.

Check five things:

Coverage: are there metrics for at least one key indicator per area — servers (CPU, memory, disk), network (loss, latency), storage (latency and fill) and 3–5 critical services (availability and response time)?
Main alerts: are 5–10 signals identified that definitely require action (e.g., service down, disk nearly full, sudden storage latency spike, high error rate)?
Capacity with buffer: are early warnings configured instead of only "when everything breaks"? An alert "7 days until full" is more useful than "disk 100%."
Reality test: did you run a test incident and see the chain from alert to action? A simple test is temporarily stopping a noncritical service or filling part of a disk.
Instructions: does each alert have a short checklist — where to verify the problem, first steps, when to escalate?

If at least two items fail, for a week reduce metrics and alerts to a clear minimum. This helps monitoring become an actionable tool, not a noise source.

Practical example: "everything works but is very slow"

Symptom: users report the service "thinks" for 5–10 seconds but doesn't crash. Dashboard is green: few errors, server CPU 30–40%, memory without swap. It seems not to be hardware.

The first signal comes from metrics closest to the user: p95 response time rises while p50 stays almost unchanged. This often means some requests hit a rare but heavy delay. Errors are low because timeouts haven't happened yet.

A quick check along the chain helps:

network: RTT to DB or storage rises, micro packet loss appears
storage: read/write latency increases, disk queue grows
DB: p95 query time grows, number of locks increases
application: hits connection pool limit, request queue length grows

In one case the root cause was storage: every few minutes latency spiked 5–7× and rare requests fell into the long tail. Alerts didn't fire because thresholds used averages over long windows (e.g. "10-minute average") that smoothed spikes.

What changed: thresholds moved to p95 (or p99) for latency and added a new signal — share of requests slower than the SLA. That alert fires earlier and rarely makes noise.

Next steps: how to roll out and maintain monitoring

It’s better to deploy a clear set of metrics in 1–2 weeks and rely on them than to spend months collecting everything and not trusting the data.

A practical plan:

Choose first-level metrics for each zone (servers, network, storage, services) and define what thresholds mean a problem.
Build simple dashboards by role: admin sees CPU/memory/disks, network engineer sees loss/latency, service owner sees response time and errors.
Review alerts: remove duplicates, disable alerts without actions, keep only those the team actually responds to.
Schedule a failure test: disable a node or throttle a link and check who gets alerted, what they do and how long recovery takes.
Introduce regularity: a short monthly incident review and threshold tuning.

A simple rule: if an alert arrives, the on-call should have a clear next step.

If you need help with hardware and implementation, an external review of your infrastructure can help. GSE.kz (gse.kz), as a manufacturer of S200 servers and a systems integrator, has experience matching platforms to loads and offers 24/7 support — useful when monitoring needs both setup and ongoing maintenance.