Which metrics should I start with if users complain about delays but the broker is "green"?

First, measure **end-to-end latency**: the time from when the source created the event to the actual result in the target system (DB write, updated status, sent notification). Broker metrics are important, but they don't show how long the user is actually waiting.

Why is consumer lag in Kafka not the same as business-facing latency?

Consumer lag shows how far a consumer is behind the partition tip, but it doesn't guarantee that a specific message quickly produced a result. A message can be "stuck" in processing, retries, or an external dependency even with low lag. Always complement lag with processing-time metrics and end-to-end latency.

Which metric best shows that a stream actually "stalled"?

Use the **age of the oldest unprocessed message** (oldest message age). This metric answers "how long someone is waiting" and is usually simpler for thresholds than average latency or total backlog.

Why shouldn't I alert on average latency?

Look at percentiles, at least p95 and p99, because problems usually live in the tail. The average can look fine while a small portion of messages hangs for minutes and generates complaints and incidents.

Where do duplicate messages come from and how to protect against them?

Duplicates most often come from the producer resending (not sure the broker accepted the message) or from the consumer reprocessing (it performed the action but failed before ack). Basic protection is idempotency on the effect side: update by business key, unique constraints, and an explicit idempotency key to prevent the operation from being applied twice.

How to set up retries so they help rather than break the stream?

Retries help when they are rare and short, but become dangerous when they turn into a "retry storm" that clogs the queue. Limit the number of attempts and total retry time, classify errors into transient and permanent, and send invalid data to DLQ faster so it doesn't block the whole stream.

Which DLQ signals really matter?

Watch not only the DLQ size, but also the **age of the oldest message** and the ingestion rate. DLQ must be a quarantine with a clear reason, an owner and a handling process; otherwise it quietly accumulates debt and suddenly becomes a major business problem.

What do I need for end-to-end control if I only have broker metrics now?

Add a single correlation_id (for example, orderId or requestId) that flows through producer, broker and consumer and is stored in DLQ. Also record produced_at and processed_at so you can see where time is lost: in the queue, in processing, or in an external system.

Which alerts are useful and don't annoy on-call teams?

Alert on user impact: oldest message age above threshold for several minutes, lag rising and not falling, accelerating DLQ growth, an increased share of errors or attempts per message. Reduce noise with delayed alerting, incident deduplication, and different thresholds for critical and background streams.

What should I check before release and who to contact if I need infrastructure and integration for Kafka/MQ?

Make sure you have clear SLOs for critical streams, a correlation id and timestamps in events, limited retries and a working DLQ handling process. If you plan infrastructure updates or cluster expansion, GSE.kz (gse.kz) can help with server selection and supply, system integration and 24/7 support so operations stay predictable.

Monitoring Integrations and Queues: Metrics and Alerts Without the Noise

The real problem: where messages and time are lost

Business teams rarely see "broker errors" — they see symptoms: an order never arrived, a status updated after 30 minutes, a payment posted twice, or a notification sent to the wrong person. Teams often say "Kafka/MQ broke", but time is usually lost at the edges: the producer didn't send, the consumer didn't process, retries looped messages, and "dead" messages sat unnoticed in the DLQ.

This happens because most dashboards show what is easy to measure (CPU, memory, overall throughput), not what helps during an incident. When users complain about latency, a "messages per second" chart can look perfect. When duplicates occur, broker metrics may miss the problem: it often originates in retry logic, deduplication, or inside the handler.

Observability differs from a set of charts in that it answers four questions: what failed, where, since when, and how many real operations are affected. For that you need end-to-end control across the chain "producer - broker - consumer - external system", not just "the queue is alive".

Working monitoring is not "we have 50 dashboards". It's a few clear signals that quickly point to the cause: latency on key flows, error and retry rates, duplicates and idempotency violations, and DLQ treated as a managed queue, not a graveyard.

If these signals are tied to business operations (e.g., "create order", "process payment", "update status"), root cause search usually takes minutes, not hours.

A short integration model: make metrics tell the same story

To avoid fragmented charts, agree on one simple model. In any queue or topic there are three roles: producer (sends), broker (stores and delivers), consumer (reads and acts). Latency can appear at any step, and metrics should show exactly where.

A practical view of the message path:

System A created an event and attempted to send it.
The broker accepted and stored it.
The consumer read and processed it.
System B observed the effect (DB write, status change, notification sent).

Ack explained simply: the consumer tells the broker "I processed this, consider it delivered." If there's no ack, retries kick in: repeated attempts to deliver or process.

Idempotency means: "if the same message arrives again, the result won't be corrupted." For example, update a status by key instead of creating a duplicate record.

Duplicates usually appear in two places: on resend (producer unsure the broker accepted) and on reprocessing (consumer failed after performing the action but before ack). So some metrics must cover delivery, others must cover the effect.

DLQ (dead letter queue) should be a quarantine, not a dump: it should hold messages that failed after a reasonable number of attempts. DLQ entries must include clear reasons, attempt counters and an owner who will investigate and either return the message to processing or fix the data.

End-to-end control starts with one correlation field (for example, orderId or requestId). It links the event in system A with the result in system B and lets you measure the whole path, not just "the message is in the broker".

Latency metrics: what shows real user pain

Queue latency by itself rarely matters to the business. What hurts is: "the order didn't arrive", "payment is stuck", "the doctor doesn't see the result", "the report updated an hour later." So measure not just "what's in the queue", but the message path to the final action.

End-to-end latency: measure from event to result

The most useful metric is end-to-end latency: time from when the source created the event to when the target system confirmed the result (DB write, created document, sent notification). Practically, this uses a timestamp in the message and a "finish" mark in the consumer logic (or a separate response event).

Simple example: service created an invoice at 10:00:00, accounting saw it at 10:07:30. End-to-end latency is 7 minutes 30 seconds, even if the broker was "healthy."

Consumer lag: useful, but not equal to user latency

Consumer lag shows how far the consumer is behind the partition tip. It's an indicator of pressure, but it doesn't tell you "how long a specific message is waiting." Lag can be high during a planned bulk load without users noticing. Conversely, lag can be small while a single "old" message is stuck due to a processing error.

To catch actual waiting, age of oldest message is often helpful. It's simpler for thresholds: if the oldest unprocessed message is already 10 minutes old, someone is certainly waiting.

Don't rely on average latency as the main signal. Look at percentiles (p95, p99): issues hide in the tail, where 1–5% of messages slow the system and create a stream of complaints.

Also compute latency separately by event type and key queues. "Authorization" and "nightly export" have different expectations; an overall chart easily hides real pain.

Reliability metrics: errors, retries, duplicates and DLQ without illusions

Reliability usually fails not in the broker but in handlers: code, dependencies, or data. Keep metrics that answer the simple question: do messages reach the result or get stuck in attempts?

Processing errors and retries

Start with the error rate and its velocity. It's important not only how many errors occurred per hour, but how quickly the curve is rising: a sharp spike often means an external service fell or a new release introduced a bug.

Retries help while they're short and rare. When retries multiply, you get a "retry storm": queues fill with the same events, latency grows, and new messages wait longer.

Practical metrics here:

percentage of messages that finished with an error (broken down by error type);
average and p95 number of attempts per message;
time between first attempt and successful processing;
number of messages that have been in retries longer than N minutes.

DLQ, poison messages and duplicates

DLQ shows where the system admitted defeat. Track not only DLQ size but also the age of the oldest message: a full DLQ is less scary than a DLQ that nobody inspects for weeks.

A poison message is visible by the same key/ID repeatedly landing in DLQ or causing the same error. If one message causes 30% of errors, that's likely bad data, not an "unstable system."

Duplicates are dangerous where money, limits, statuses and balances matter. Count them by idempotency key or business key (e.g., payment number or order id) within a time window.

Alerts that tend to help without much noise:

DLQ ingestion rate rises above baseline;
oldest message in DLQ exceeds threshold;
share of messages with attempts > X grows for several intervals in a row;
share of duplicates by business key exceeds normal;
the same error repeats for one key more than N times.

Throughput and resources: don't confuse cause and effect

When latency starts, it's easy to misdiagnose: it looks like the broker can't cope, but consumers are slow or there's a network bottleneck. So watch not only "how many are in the queue" but the flow balance: how many arrive versus how many are processed.

Compare input and output streams on the same timescale: messages/sec and bytes/sec. If input consistently exceeds output, accumulation is inevitable. If input fell but backlog grew, the problem is often in processing or acknowledgements (ack/commit).

Backlog (or consumer lag in Kafka) is useful only with its rate of change. "10,000 messages" alone says little. What matters is whether the number is growing and how fast you can "eat the tail" after a spike.

Common throughput limits

In Kafka the upper bound often hits partitions and consumer group structure. One partition is read sequentially, so adding consumers doesn't help if partitions are few or keys are poorly distributed.

To find the bottleneck quickly, check:

input vs output stream in the same interval;
backlog growth rate and time to zero at current speed;
lag distribution across partitions (one "hot" partition can break the whole flow);
processing time on the consumer side (mean and tail p95/p99);
commit/ack errors and retries due to timeouts.

Look at resources broadly: CPU, memory, disk and network. A common pattern: CPU looks fine but GC pauses or slow disk cause long stops. Consumers "hiccup" and fail to ack in time.

Peaks are not the same as a normal day

Mark peak windows separately: month-end, reporting periods, bulk exports. For example, the system is stable during the day but at 18:00 input doubles for 30 minutes and backlog grows for hours. For these windows use separate thresholds and alerts on "recovery time" (how long to return to normal). Otherwise you'll either stay silent during a real peak or get noise all month.

Alerts that don't annoy: rules, thresholds and priorities

Observability without unnecessary noise

Let's discuss which metrics and alerts will warn you before user complaints arrive.

Request a consultation

A good alert answers one question: what breaks for the user if nothing is done. Start from SLOs: how many minutes of processing delay are acceptable, what percent of messages can be in retry, and when the stream is considered "stalled."

Teams often collect dozens of charts, but alert on risk rather than on every change. The most useful signals are usually about time and unprocessed tails: oldest message age, accelerating consumer lag, accelerating DLQ growth, and duplicate share when it affects money, inventory or reports.

Threshold alerts work where there's a clear pain boundary. Example: "message age > 5 minutes for longer than 10 minutes" for a critical process. Trend alerts are better when load fluctuates: not "lag 20k", but "lag rising for 15 minutes and not falling."

To reduce noise, follow a few simple rules:

use delayed triggers (e.g., 5–10 minutes) to avoid short spikes;
group alerts by stream or topic, not by each partition;
deduplicate: one incident — one notification, then only updates;
tie thresholds to business windows (e.g., different night thresholds for non-critical flows).

Set priorities by cost of downtime, not by chart aesthetics. A convenient three-level scheme:

P1: business-critical flows (payments, accounting, access) and DLQ growth that blocks recovery;
P2: degradation without immediate loss (retries rising, latency 2–3x normal);
P3: early signs (lag trending up, processing time increase) for daytime investigation.

End-to-end control: correlate across steps, not just inside the broker

If you only look at the broker you can see "all green" while users wait for minutes. End-to-end control links request, publication, queue, processing and the final result in one chain.

A correlation identifier (correlation_id) is a common number that travels through all steps. Usually it's taken from the incoming request or created in the first service, passed in message headers and written to logs. Store it at producer, consumer and in DLQ, otherwise investigation breaks exactly where it hurts most.

A consistent log format helps: the same set of fields and consistent names. Then a search by correlation_id yields an ordered sequence, not an unrelated set of lines.

Minimal event/log fields for diagnosis:

correlation_id;
message_id (unique for the message);
event_type and version;
produced_at and received_at/processed_at;
source_system and consumer_service.

Tracing must connect "request -> event -> processing." Even without advanced tools, start with a simple rule: write the same correlation_id and a timestamp at each step, and include produced_at in the message.

To quickly see which system is slow, use a "flow map": for each step show median and 95th percentile time (in queue and in processing). If queue delay grows while processing is stable — look at broker throughput or consumer limits. If processing time grows — look at the service or its external dependency.

Step-by-step: how to implement MQ and Kafka monitoring in a reasonable time

Reliability audit for integrations

We will check retries, DLQ and idempotency so duplicates don't hurt the business.

Order an audit

Good monitoring starts with agreements. Trying to cover all topics and queues at once leads to noise and fatigue. Pick a few streams that truly affect customers and processes, and define "normal" as clearly as you define an SLA.

A 2–3 week plan that usually works

Start with 3–5 critical flows: for example, "order created", "delivery status updated", "invoice issued." For each define expected behavior: maximum latency, allowed retries, what counts as an error and where async processing is acceptable.

Then follow a simple chain:

Add a timestamp (when the event occurred) and a correlation id to events. Without this end-to-end control collapses.
Collect basic broker and client metrics: consumer lag, processing rate, error rate, queue depth, DLQ size, retry share.
Build one on-call dashboard: 5–8 key charts and a "top problematic streams" list by latency and DLQ.
Set 5–10 alerts for symptoms: latency above N minutes, lag rising for X minutes, DLQ non-empty for Y minutes, processing error above Z%.
Run tabletop incidents: artificially slow a consumer, disconnect an external dependency, trigger mass retries. Check that alerts are clear and the on-call knows next steps.

Also agree DLQ and retry rules: how many attempts, which errors retry, when a message goes to DLQ, who inspects it and how results are recorded (resend, compensate, mark as impossible). This turns observation into a manageable process.

Common traps: metrics exist, but control doesn't

Often it looks like: "we enabled metrics and a few alerts," but users still report incidents. The reason: metrics exist, but there are no answers to "where is it slow" and "what to do."

Typical mistakes:

alerting on averages. Average latency can look fine while a small share of messages hangs for minutes. Watch the tail (p95/p99) and alert on it.
confusing consumer lag with business latency. Lag shows unprocessed messages but not how long a client waits for the result. You can have low lag and long processing time (external service, DB, locks). Or high lag and no user pain if the queue is non-critical.

Three more silent killers:

unlimited retries hide failures: the system "seems to work" while debt grows and you face an avalanche on recovery;
DLQ without a handling process becomes a dump: it grows for weeks and surfaces at the worst moment;
no idempotency makes duplicates dangerous: a redelivered message can double-charge or create duplicate orders.

One more mistake: identical thresholds for all queues. Two-minute delay is an incident for payments, but normal for a nightly sync. Set thresholds and priorities by criticality and expected processing time.

Quick checklist: what to check before release and during an incident

Before release, ensure you control message passage, not just broker health.

Before release check:

Do you have oldest message age per critical flow, not only total backlog?
Are DLQ alerts configured for growth and "stuck" messages (oldest age)?
Has duplicate handling been tested: the same event arrives twice and the system doesn't corrupt data (idempotency, dedupe keys, unique constraints)?
Are retry limits and policy recorded: how many attempts, what backoff, where recovery ends and damage begins?
Is there a DLQ owner and a documented procedure: root cause analysis, data fix, reprocessing and reporting?

During an incident, gather the picture in one place. The on-call needs one dashboard: age, backlog, errors/retries, DLQ. It's faster than jumping between screens and arguing "where exactly it's slow."

During an incident:

Compare oldest message age and backlog: if backlog is small but age is large, the issue is often a partition key or a single slow consumer.
Check retries: is the same error spinning and creating an avalanche across streams?
Inspect DLQ: is it growing and is oldest age increasing (meaning nobody is investigating)?
Assess duplicates: are repeat charges/orders appearing because of redelivery and lack of protection?
Record a temporary mitigation and an owner: what we do now (pause, throttle, manual handling) and what we fix later.

A real-world example: latency and duplicates in a critical flow

Servers for Kafka and MQ

We will calculate a configuration for your throughput, retention and availability requirements.

Request sizing

In one critical flow system A published "application confirmed" and system B updated a customer's card status. For the business it's simple: the user clicks a button and should see the new status in a few seconds.

The issue appeared not as an error but as "sticking": statuses updated after 10–40 minutes, and for some customers statuses flickered. Tickets described instability even though the broker looked "green."

The first helpful change was to look at time. We checked message age and consumer lag: new events accumulated faster than system B could process. Processing errors rose and DLQ started filling.

Next we needed to distinguish overload from bad data. With overload lag grows smoothly while error rate stays roughly stable. With a poison message the pattern is different: lag jumps, the same error repeats, and DLQ contains events with the same cause (unexpected field value or schema mismatch).

Steps taken:

Limited retries by count and time so the same event wouldn't be retried for hours.
Temporarily isolated the problematic event type into a separate stream so it wouldn't block others.
Processed the DLQ: identified one root cause, prepared a data fix and reprocessed.
Added idempotency on side B (operation key) so redelivery wouldn't change the status twice.

To prevent recurrence we enforced two rules: an alert on "message age > X minutes" with priority higher than a generic error alert, and a separate alert for DLQ growth by specific cause. We also updated the event schema: made mandatory fields explicit and added a version so consumers handle changes properly.

Next steps: embed the process and prepare infrastructure

For monitoring to really help, make it part of operations, not just charts. Start with an inventory: which integrations matter for money, patients, students, public services or deadlines, and who owns the business result.

Agree SLOs with process owners in simple terms: how many minutes of delay are acceptable, how many messages can be lost (usually zero), how fast you must recover. This gives meaning to thresholds and priorities.

Separate metrics by level. At the application level track user-facing facts: time from event creation to processing, retry share, duplicate count, percent of messages in DLQ. At the broker level track transport health: lag, consumption rate, producer/consumer errors, disk usage, partition and replication status.

A practical 2–4 week plan:

List critical streams and assign owners;
Fix 3–5 mandatory app metrics and 3–5 broker metrics;
Run a load test and one tabletop incident (e.g., stop a consumer for 10 minutes) and verify alerts trigger clear actions;
Write a runbook: who responds, what to check first, where to inspect DLQ and how to safely reprocess.

Prepare infrastructure ahead: reliable servers for brokers, redundancy, disk planning for retention, separate storage for metrics and logs, and access and audit. For organizations with higher requirements make sure procurement and support are transparent.

If you plan cluster updates or expansion, GSE.kz (gse.kz) can assist with server selection and supply, system integration and 24/7 support to keep queues and monitoring predictable.