Monitoring LLMs in production: metrics and alerts without surprises
LLM monitoring in production: which metrics to collect for queues, tokens, errors and quality, and how to configure alerts to detect failures before users do.

What problems in LLM production should you catch before users complain
An LLM in production looks like a normal API from the outside, but internally it has three specifics: cost depends on tokens, responses are not always predictable, and quality and availability often rely on an external provider and its limits. So monitoring here answers not just “is the service alive”, but several questions at once: has it gotten worse, more expensive, or more risky.
User complaints almost always lag. First the queue grows, then latency increases, later timeouts appear, and only then someone writes to support: “the bot froze”. The story is similar for quality: the model drifts, mixes up facts or becomes too vague, and this becomes noticeable after dozens of dialogs.
It helps to agree in advance what “working” means for you. Usually this is a combination of several criteria: speed (responses arrive on time even at peak), stability (no spikes in errors or disconnects), quality (the answer solves the task and doesn’t just sound plausible), cost (tokens and expenses are predictable) and safety (no data leaks or dangerous prompts in logs).
Another common surprise: the problem may not be on your side. The provider changes a model, adds new limits, network packet loss increases somewhere, and you get degradation without any release on your side.
To catch this before complaints, define roles and reactions ahead of time. Product owns quality and conversion to a solved task. Support watches rising repeat tickets and “unclear” dialogs. Security monitors personal data and prompt injections. Infrastructure owns queues, latency and errors. In large organizations (for example, banks or government institutions) the alignment of these roles is what makes alerts useful rather than noisy.
Signal map: what to measure and how to link data
So monitoring doesn’t become a set of disconnected charts, structure observability by layers. Each problem leaves traces in several places: hardware and network cause latency, the application shows errors and timeouts, the model shows unexpected text and higher token usage, and the business sees falling conversion or rising support requests.
The key to linking these traces is a single request_id. It should appear at entry (frontend, API gateway, bot platform) and travel through to the model response and back. Then for a single request you can quickly see whether it was queued, how long the provider call took, which model answered and exactly what was returned.
For each request record at minimum: request_id and timestamps (ingress, start/end of LLM call, egress), session_id and channel (web, messenger, internal), model/version and parameters (temperature, max_tokens, tools), outcome (success/error, reason, retries, truncation). Prompt and response texts are also useful, but store them with personal data masking.
It’s convenient to store this data in three places: metrics for numbers and alerts (latency, errors, tokens), logs for details (texts, reasons, context), and traces for call chains (frontend -> backend -> LLM -> external services). It’s important that the same request_id exists everywhere.
A simple example: users say “the bot got dumber”. With request_id you might see the issue is not the model but timeouts: some responses are truncated because the queue grew and the application cuts max_tokens to meet time limits.
Queues and latency: metrics that show overload
Overload almost always manifests the same way: slow responses, timeouts and a “jittery” response time. If you wait for complaints you’ll be firefighting. So first and foremost keep queue and latency metrics: they show earlier than others that the system can’t keep up.
Start with the queue: length, average and p95 wait time, processing rate (how many requests per second are actually handled). Look not only at “how many are waiting” but also “why”: the queue grows because of traffic spikes, reduced model throughput or provider limits.
To find the bottleneck faster, break down latency by stages and compute p50/p95/p99 for each: pre-processing (validation, context preparation, RAG), model call (time-to-response and generation time), post-processing (filters, formatting, DB writes), and network delays. Treat timeouts and retries as separate events.
If you run your own infrastructure (for example, on-prem in a data center), add concurrency and resource utilization: CPU/GPU, memory, disk, network. A common scenario: p50 stays normal while p99 rises because GPUs hit memory limits and some requests fall into the tail.
Don’t forget provider limits: rate limit, throttling, remaining quota. Count these as separate error types, otherwise they’ll look like “random timeouts”.
For streaming you need separate signals: time to first token and tokens-per-second. A user will often wait longer if the first token comes quickly and the stream proceeds smoothly.
Tokens and cost: how not to miss rising spend
If the service is stable but bills grow, the cause is almost always token usage. Monitoring should show not only “how many requests” but “how much text you’re sending through the model” — by user, scenario and time.
Token metrics to collect
For each request record input_tokens, output_tokens and tokens_per_request. Then look at distributions: median, 90th and 99th percentiles. The mean often deceives: rising costs usually come from rare, very long requests.
Useful slices: scenarios (support, knowledge search, letter generation), user types (internal staff, external customers), context source (chat history, documents, system prompt), model/version (length often changes after switching).
Next layer — money. If billing is token-based, compute estimated cost per request and total cost per scenario per hour/day. That makes it easy to see that, for example, 10% of traffic consumes 60% of the budget.
What usually inflates token usage
Most often it’s long context and growing chat history. Measure the share of requests with long prompts (for example, above the weekly 90th percentile) and record reasons: more documents were added, logs are attached in full, the history isn’t truncated.
Count how many requests hit or approach the max context. That affects both cost and quality: the model loses useful context or you must blind-truncate data.
If you use caching (responses, embeddings, retrieved documents), track cache hit rate and estimate “tokens saved”. In a support chatbot, repeated questions quickly generate real savings: tokens per request drop while request count stays the same.
Errors and reliability: what to log and count
Reliability of an LLM service starts with logging discipline, not dashboards. You must be able to answer a simple question: what exactly failed — the model, your code or a dependency?
Collect errors in two dimensions: provider responses (for example, 4xx and 5xx) and application errors (input validation, parsing failure, prompt assembly error). Track timeouts separately: they often look like “mysterious” outages but are actually overload or hung dependencies.
Minimum useful data per request: request_id, user_id/tenant (if any), scenario (chat, search, summarization), result code (ok, provider_4xx, provider_5xx, timeout, app_error), streaming status (completed/aborted/partial), retry count and retry reason, which dependencies were called (search, vector DB, document store, auth) and their statuses.
Treat retries as their own metric, not “repeated and all good”. Track the share of requests with retries, average retries per request and added latency. A rise in retries often precedes a rise in clear 5xx errors.
There are silent failures too: empty responses, stream aborted mid-way, partial answers without a final section. These are easy to miss if you only look at HTTP codes. Add counters for empty_output and stream_aborted.
If you use circuit breakers and fallbacks (for example, no RAG, simplified prompt or backup provider), measure the share of requests hitting fallback mode. That reveals a degrading dependency before complaints spike: the vector DB times out more often, the system uses fallback more, and users get more generic answers.
Answer quality: simple metrics without complex science
It’s easier to start measuring quality by clear outcomes rather than “intelligence”. Add counters for response types: successful (helped and resolved the task), refusal (model didn’t answer or cited policy), and caution (answer with caveats, incomplete, asks for clarification). This already shows where users get stuck.
Then add simple auto-checks from logs that often correlate with problems: answer length (too short or a wall of text), repetitiveness (loops and repeated phrases), format adherence (asked for 3 steps or JSON and got plain text), language (user is Russian but model suddenly replies in another language), toxicity and rudeness.
Watch business signals too. For an internal IT assistant, alarming signs are immediate repeat questions after an answer, increased escalations to support, drop-off during an operation (user starts filling a request and abandons), and a falling share of resolved requests per session.
Automation doesn’t replace humans, so add manual review on a sample. Choose 20–50 dialogs weekly and rate them on 3–5 consistent criteria: “answered on point”, “no hallucinated facts”, “gave the next step”, “tone OK”. Criteria must be consistent across reviewers.
To spot regressions after prompt or model updates, keep a small test set (golden set) of typical tasks. Run it regularly and compare to the baseline by the same criteria and automatic signals.
Degradations and drift: how to notice worsening in time
Degradation in LLMs rarely looks like “everything broke”. More often it’s gradual: rising refusals, falling usefulness or more hallucinations. Latency and error charts can remain green while users are already unhappy. So monitoring must include quality signals and changes in input data.
What to treat as drift and how to measure it
Compare “today” to a baseline week when things worked well. Drift often shows up in changing request patterns: topics and intents (purchases, support, security), languages and code-switching share, prompt and response length, share of complex cases (files, tables, niche terminology), and share of attempts to bypass rules and sensitive topics.
If the share of complex cases grows, quality can drop even without model changes. That’s normal, but it should trigger a different mode: more context, adjusted prompt or routing to a human.
Canary releases and version comparison
Any change (model version, prompt tweak, new content source, new audience) should be released as a canary: 1–5% of traffic to the new version, then gradually expand.
Build alerts like “new version is worse than old” on a 30–60 minute window. Example triggers: refusals (policy/refusal) increased by X%, user low-rating share rose by X points, share of answers where the model failed to find needed content and “made things up” increased, or security incidents rose (PII, sensitive data, bypass attempts).
Example: after a prompt update for an internal support chatbot the share of “make an exception to rules” requests grows while usefulness drops. That’s not “tune temperature” — it’s a signal to check whether you weakened constraints or changed instruction tone.
How to set up monitoring step by step: from zero to quality control
To make monitoring truly helpful, start with a simple agreement with the team: what normal looks like and when to wake the on-call.
1) Define SLOs, not “everything”
Pick 2–3 indicators that reflect user experience. For example: p95 response time, error share (5xx, timeouts) and one quality metric (at least share of “accepted without edits” or operator rating).
Then layer in additional work:
- Map the full request path and set timing budgets for stages: queue, preprocessing, model call, post-processing, storage.
- Normalize logs: model name, prompt version, parameters (temperature, max_tokens), input and output tokens, status, error reason.
- Build dashboards for real scenarios, not “general”: support chat, document search (RAG), report generation.
- Add regular runs of the test set and a short quality report (at least daily).
- Tie each metric to an action: what does the on-call do if a threshold is exceeded.
2) Make quality measurable without complex science
Simple approach: the same set of 50–200 typical queries and expected response signs (found the right document, didn’t hallucinate, kept the format). If only one scenario degrades, it often points to a concrete cause: a new prompt, index change, or model swap.
Example: users complain about “too long answers”. With stage timing and token logs you’ll quickly see where the growth is: queue time rose, or the model started generating 30% more tokens. These require different actions: scale workers vs limit max_tokens and edit the prompt.
Alerts that work: thresholds, windows and clear actions
A good alert answers two questions: what broke and what to do next. If the pager requires 30 minutes of hunting to find the root, people will ignore alerts or get annoyed.
Keep alerts in three groups. The first group catches symptoms visible to users: p95 latency, rising queue, spike in 5xx, increased retries. The second group catches causes that may not yet affect users: provider rate limits, nearing token or budget quotas, dependency degradation (vector store, DB, network). The third monitors quality: golden set regressions, rising refusal share, increasing escalations to operators.
Prevent noisy alerts by using windows as well as thresholds. For example, “p95 > 4s for 10 minutes” is usually better than “p95 > 4s for 1 minute”. Two-stage rules help too: a warning for soft thresholds, critical for hard ones, plus deduplication so one incident doesn’t create dozens of notifications.
Example: a support chatbot’s queue grows and p95 climbs. The first alert (symptom) says “overload”, the second (cause) shows “we hit the provider’s request limit”, the third (quality) remains quiet. Then the immediate action is simple: throttle load (limit parallelism or enable degraded responses) and then investigate limits and capacity.
Each alert needs a short runbook, otherwise responders will act randomly. Five short items are enough:
- Where to look for confirmation: charts, logs, traces.
- What to check in the first minute: queues, 5xx, limits, dependency health.
- How to quickly reduce damage: feature flag, lower temperature, disable heavy tools, revert to cache.
- How to roll back or switch to fallback mode.
- When and whom to escalate (with clear timing and criteria).
This is especially important when an LLM service runs 24/7 and depends on infrastructure and support, as in large integration projects and data centers.
Common monitoring mistakes for LLMs and how to avoid them
The first trap is trusting one pretty number and thinking everything is fine. Often that number is average response time. The mean hides painful tails: users care not about “2 seconds on average” but “sometimes 30 seconds”. Keep p95 and p99 alongside the mean, and separately track queue time and model execution time.
The second mistake is mixing different scenarios in one metric. Support, knowledge search and letter generation follow different rules: different prompts, token limits and latency tolerance. Split metrics by task type, channel (chat, API), language and critical customers so you see local failures rather than a “hospital-wide average”.
Third: not recording what was deployed. Without model version, prompt version, parameters (temperature, max_tokens) and context (data source, enabled tools) you can’t fairly compare before/after or roll back quickly.
Another frequent failure is alerts without an owner or clear action. If a notification doesn’t answer “who and what do in the first 10 minutes”, it will be ignored.
A set of habits that usually helps:
- Build dashboards and alerts by scenario and segment.
- Keep versioning: model, prompt, parameters, features and release date.
- Write alerts with a short verification step and an owner.
- Monitor token usage per request as well as errors.
- Check p95/p99, not just the mean.
On cost: rising tokens look like a “small thing” until the bill doubles. Set limits and alerts on tokens/request and tokens/user, and separately on “cost per 1k requests” per scenario.
Quick checklist and next steps
If monitoring exists, ask weekly: do you learn about problems from an alert or from a user complaint?
Minimum daily signals to keep handy:
- Queue and latency: queue length, p95 (and preferably p99) latency, time until generation starts.
- Errors and retries: share of 5xx, timeouts, cancellations, percent of retried requests, provider limit errors.
- Tokens and cost: input and output tokens per request, cost per 1k requests, top expensive prompts.
- Quality by sample: regular review of responses (for example, 50–100 real dialogs per week) and a simple OK/not OK rating with reason.
Keep dashboards simple: one for “service health” (traffic, p95, errors), one for “costs” (tokens, spend, most expensive scenarios), and one for “quality” (share of bad responses by sample and main causes).
When an alert fires, prioritize stabilizing the service and avoid making the situation worse with releases:
- Freeze releases and roll back recent changes if they coincide with the incident.
- Enable fallbacks (templated answers, knowledge-search, simpler model) for critical scenarios.
- Limit context: trim dialog history, attachments and overly long prompts.
- Reduce load: rate limit, throttle queues, temporarily disable nonessential features.
Then plan improvements based on root causes: weekly review the most expensive requests and typical quality failures, set actions and expected impact.
If you deploy LLMs on your own infrastructure (including on-prem and data centers), embed observability, SLOs and incident processes from the start. In such projects GSE.kz (gse.kz) often participates as a system integrator: from hardware and servers to AI infrastructure and 24/7 support, so operations don’t rely on manual guesswork.
FAQ
Which LLM service metrics most often show a problem first?
Start with queue metrics and latency percentiles (p95/p99), the share of timeouts and 5xx responses, and retries. These signals usually rise before users complain and clearly indicate overloads or provider issues.
Why is a unified request_id needed and what does it give in practice?
A single `request_id` ties metrics, logs and traces for one request together: how long it waited in queue, how long the model call took, which prompt and model version were used, where a timeout happened, and whether the response was truncated. Without it you only see the “average temperature” and can’t pinpoint causes for specific failures.
How to pick SLOs for an LLM without drowning in metrics?
Fix 2–3 SLOs that reflect user experience: for example, p95 response time, error/timeout rate, and one simple quality indicator. Add other metrics as diagnostics; otherwise dashboards become noisy and unhelpful during incidents.
How to quickly notice rising token costs and find the cause?
Log `input_tokens`, `output_tokens` and `tokens_per_request` for each request and look at distributions rather than the mean. Also compute estimated cost per request and per scenario so you can quickly spot which request types consume the budget and whether context or answer length grew.
How to tell provider LLM problems from issues in your application?
Treat provider-limit errors separately: rate limits, throttling, quota exhaustion, and provider-side waits. If you mix them with ordinary timeouts you’ll waste time chasing the wrong causes.
Which metrics are needed when responses are streamed?
For streaming, track two metrics: time to first token and tokens-per-second. Users tolerate longer total times if the stream starts quickly and proceeds steadily, so these signals correlate better with the feeling “the bot is stuck” than overall response time.
How to measure answer quality without complex evaluator models?
Start from simple outcomes: helped/not helped, refusal, asks for clarification, plus technical signals like very short replies, repeated phrases and broken format. Add regular human review of a small sample of dialogs (using consistent criteria) to catch regressions that latency and 5xx won’t show.
What is a golden set and how does it help catch regressions after updates?
Create a small set of typical queries and run it regularly, comparing the current version to a baseline. If you release changes as a canary on 1–5% of traffic and watch differences in refusals, cost and quality on a short window, regressions are visible before broad user impact.
How to configure alerts so they are useful, not noisy?
A good alert says what worsened, over which period, and what to check in the first minute. Use windows (for example, 10 minutes), separate symptom alerts (p95, queue, 5xx) from cause alerts (limits, quotas, dependency degradation) and keep a short runbook so responders know the immediate steps.
How to log prompts and responses without creating personal data risks?
Store prompt and response texts only when they are really needed for investigations and always apply masking of personal data. Logs should keep identifiers, prompt versions, generation parameters and technical quality signals so incidents can be investigated without leaking sensitive data.