SLA for an internal chatbot: measurable metrics and exceptions
SLA for an internal chatbot: how to set availability, response times and request limits, and how to handle maintenance windows, exceptions and governance.

Why an internal chatbot needs an SLA and what it should clarify
An SLA for an internal chatbot is not just a formality. It helps employees know what to expect from the service day to day, and it tells IT and support what commitments they are actually taking on. Without an SLA the bot quickly becomes a source of disputes: some expect instant answers at any time, others accept that the service will work “as possible” if more urgent tasks appear.
The main value of an SLA is turning expectations into numbers and simple rules. When the document contains measurable indicators, ambiguity disappears: a promise can be verified. This is especially important if the bot is used widely and affects several departments.
Typically an internal chatbot handles repetitive questions and accepts requests. Most often this is HR (certificates, time off, onboarding), IT (password resets, access, software installs, incident status), facility requests (office supplies, passes, repairs), finance (approval statuses, routine questions) and training (course access, schedules, regulatory answers).
The SLA should be understandable to a non-technical person. After reading it, it must be clear: when the bot is available, how quickly it responds, what happens during request spikes, where to go if the bot doesn’t help, and which situations are not considered violations (for example, planned maintenance or problems at external providers).
A simple example: an employee in a branch tries to restore access at night. If the SLA already describes availability, target response time and an alternative support channel, the person has a plan, and IT has clear boundaries of responsibility.
Service boundaries: what the SLA covers and what it doesn’t
To avoid turning the SLA into a blame game, first agree on boundaries. The phrase “the bot must work well” doesn’t help. You need a clear map: what exactly is considered the service, where it operates and who is responsible for what.
“Service” usually means not only the bot logic but the whole path to the user. If the bot is available in multiple channels, these are different failure points and different metrics. The bot can be healthy while a channel (Teams or Telegram) is temporarily unavailable. Or an integration with AD, an HR system, or the knowledge base may fail.
Practical framework:
- What the SLA covers: the bot (scenarios), communication channels, key integrations and their interfaces.
- What the SLA does not cover: outages at external providers, problems on employees’ devices, errors in source data of third-party systems if not controlled by the bot owner.
Next, clarify the operating mode. “24/7” and “business hours” are not just words: they change how availability is calculated and the expected reaction to incidents. It’s often sensible to split responsibilities: the bot answers 24/7, while restoration and investigation of complex issues follow an on-call schedule.
Agree separately on scenario criticality. Password resets, access to corporate systems and IT requests are usually more critical than reference answers. Different classes of requests can have different response time targets.
And assign a service owner. It should be clear who approves changes, who leads incidents, who communicates with users and who is responsible for integrations (the source system owner or the bot team). In practice this prevents most conflicts before the first outage.
Availability: how to define and measure uptime
In an SLA it’s important to agree what “availability” actually means. The bot can open in a messenger and accept messages but not provide useful answers due to a failed integration with the knowledge base, HR system or ticketing service. That’s why it’s better to separate two levels: interface availability (you can send a message to the bot) and key-function availability (you can get an answer or perform an action).
Availability percentage is specified together with the calculation period. Most often it’s a month (easier to calculate and discuss), less often a quarter (smooths one-off incidents but hides recurring problems). The wording must be unambiguous: “availability 99.5% in a calendar month”, plus an exact definition of downtime (for example, the bot does not accept messages or does not respond for more than N seconds).
Planned downtime should be described in a separate clause and explicitly excluded from calculations, otherwise uptime will look worse because of agreed updates. Specify rules: when maintenance can be performed, how to notify in advance and what maximum maintenance window is acceptable.
Minimum set of levels:
- Basic: an uptime goal for the bot shell (access, sending messages), calculated monthly.
- Enhanced: a separate target for critical scenarios (for example, creating a support ticket), with separate accounting of dependencies.
Example: the bot is available 99.5% per month, but for the “emergency request” function set 99.9% and separate integration monitoring. This way the metric shows real quality, not just whether the chat opens.
Response time and time to resolution: which metrics actually work
If the SLA only states “response time”, it will quickly become clear that everyone interprets this differently. It’s better to split metrics into two layers: reaction speed and time to result. Then agreements are verifiable.
Usually a few indicators are enough, but it is important to clearly describe from which event to which event the measurement runs:
- First response time: from the user sending a message to the first substantive reply from the bot (not just “received”, but a useful message).
- Time to resolution: until the user has received the answer or the action is completed (for example, a ticket is created and a number is issued).
- Handoff time: from the bot’s decision “human/Service Desk required” to the actual handoff (ticket created, fields filled, sent to the queue).
- Operator wait time: a separate metric if humans are involved in the process.
If the bot hands a request to an operator or Service Desk, immediately define a “stop point” rule. For example, the bot’s time to resolution is counted until the ticket is created, and further time belongs to the Service Desk SLA. The bot is then responsible for correct classification, data filling and quick transfer, but not for another team’s queue.
It’s better to separate targets by request types; otherwise one value will be either too strict or meaningless:
- FAQ and reference: fast first response and short time to resolution are important.
- Transactions (password reset, access): time to resolution depends on integrations.
- Incidents: the priority is guaranteed handoff and a clear status.
Be honest about peaks and queues. For example: measurements are taken by p95 during working hours, and during peak windows degradation to “answer with a queue number and expected wait time” is allowed. If an external system is unavailable, record this as a separate status and exclude it from time-to-resolution calculations, but not from first-response time: the bot must quickly explain what’s happening and what to do next.
Request limits: quotas, queues and clear degradation
Limits are needed not to “save resources” but to protect the service. They prevent spam, accidental integration loops (when a system pings the bot hundreds of times), and overload during peak hours. Without limits you will face rare but painful incidents that are hard to justify to the business.
Specify limits with simple measurable numbers and immediately explain how they’re counted (sliding window of 1 minute or 1 hour, calendar time or only business hours). Usually it’s sufficient to fix a per-user limit, a per-division limit (so one department doesn’t consume all resources), a global service limit and separate constraints for heavy operations (for example, deep knowledge-base searches or long answer generation).
Describe in advance what happens when a limit is exceeded. “Error 429” is honest, but the user needs a clear scenario. Minimum rules:
- a clear message: what happened and when to retry;
- 1–2 automatic retry attempts with a pause for safe operations;
- placement in a queue if the task is not urgent (with an estimated wait);
- degradation: a simplified reply instead of a full one (for example, without attachments or deep search).
Separate mass mailings and automated scenarios. If HR runs a survey for 2,000 employees or monitoring sends notifications, these should go through a service channel with agreed quotas and schedules. Otherwise one useful automation can easily push the service out of norms for everyone else.
Maintenance windows: how to plan downtime and communications
A maintenance window prevents planned downtime from looking like a sudden failure. In the SLA this is a separate block: when the service can be taken down, for how long, and how users are notified.
How to describe a maintenance window
Start simple: frequency, duration and notification rule. Most teams are fine with one regular window (for example, once a week at night) and a separate procedure for urgent patches.
Specify:
- when maintenance is performed (days of the week and local time);
- maximum window duration and allowable number of windows per month;
- how far in advance you notify (for example, 24–72 hours);
- how you notify (bot message, email, corporate channel);
- what the user will see during work (clear text and expected recovery time).
Separate bot releases from integration updates. A bot release may not cause downtime, while an update to CRM, AD or knowledge base can temporarily disable some functions: authentication, ticket search, or knowledge-based answers. In the SLA it’s useful to describe which dependencies may degrade separately from the bot and how that appears to the user.
Rollback and counting downtime
Planned work is useful only until the first failed release. Agree in advance on rollback: who decides, by what criteria and within what time. A simple rule: if error rates or response time exceed thresholds after an update, the team rolls back to the previous version.
Also define what counts as planned downtime and how it affects uptime. Typically planned windows are excluded from availability calculations only if they were announced in advance and stayed within agreed limits. If a window is extended, that becomes an incident: it should be included in reports and analyzed.
Exceptions: force majeure, external dependencies and security
Exceptions exist not to absolve responsibility but to agree in advance which events are not considered violations and how the team acts in such cases. The main rule: each exception must be measurable and verifiable, otherwise it will become a convenient excuse.
External dependencies and force majeure
A common cause of outages is things you don’t directly control: network, corporate SSO, cloud provider, external APIs, mail gateway. In the SLA these are better formulated as exceptions only with evidence (provider incident, logs of unavailability, confirmation from the network team).
Common categories include:
- unavailability of external APIs/SSO needed for authentication or responses;
- incidents in the corporate network or at a provider, confirmed by a ticket or notification;
- client device failures (browser, workstation) outside the team’s responsibility;
- mass power or connectivity outages in offices;
- breaking changes by a third-party vendor.
Define force majeure narrowly: not “any unforeseen situation” but concrete event types and indicators by which they are recognized.
Security-related exceptions
Security often overrides metrics. If the bot is under attack or there’s a risk of data leakage, temporary blocks and emergency changes are acceptable. Even here, set boundaries: what counts as an attack, who decides, and how start and end times of protective mode are recorded.
To prevent exceptions from cancelling commitments, add two conditions: (1) the exception applies only to the affected function (for example, external integrations are down but basic answers still work), and (2) there is an obligation to communicate and report.
Example: if corporate SSO is down, the bot may work in a limited mode (answer FAQs but not perform actions that require authentication). In the SLA this should be described as service degradation, not full unavailability.
How to control the SLA: monitoring, reports and evidence
To keep the SLA from being just paperwork, agree in advance how you measure indicators and who accepts the results. Otherwise after the first outage the argument will be about “whose numbers are right”.
What data to collect
The minimal set usually comes from four sources: logs, metrics, traces and incident records. Logs explain what happened (errors, timeouts, upstream responses). Metrics show the numbers (availability, latencies, error rates, queue lengths). Tracing breaks response time down by steps (bot, knowledge base, integration, LLM). Incidents record the fact and the remediation timeline: when it started, when it was detected, actions taken, when restored.
Immediately define what counts as “response time”: from pressing Enter to the first bot message, or to the full answer. For evidence it’s useful to store both values and mark escalations to humans separately.
Reports and unified time rules
Reports should be understandable to non-engineers. Usually a service owner from the business and the on-call IT team are enough. Publication can be set like this:
- an automatic short daily status on key metrics;
- a weekly SLA report and incident list;
- a monthly summary with trends and root causes.
Confirm incidents from a single “source of truth”. Most often this is monitoring (availability checks and synthetic dialogues) plus a unified time rule: one time zone, NTP sync, consistent rounding (for example, to minutes).
For alerts, set thresholds and actions in advance to avoid waking people for minor issues:
- critical: bot unavailable or 5xx above threshold for N minutes — wake on-call;
- important: increased latency or queue — open a ticket, check load;
- observation: isolated errors — log and investigate during the week.
This turns control into a clear process rather than an argument about whether obligations were met.
Step-by-step: how to write an SLA and agree it with the business
A working SLA starts not with numbers but with expectations. The business usually wants “always fast responses”, while IT thinks about dependencies, risks and resources. The task is to translate expectations into measurable rules.
Practical sequence
-
Gather real scenarios: find a procedure, check ticket status, IT help, knowledge base access. Classify them by criticality: critical (stops a process), important (reduces quality), convenient (nice to have).
-
Fix metric definitions. What counts as availability: health check, successful API response, or the user getting an answer in chat? What is “chatbot response time”: to the first message or to the full answer? Record units (seconds, minutes, percentages) and measurement points.
-
Agree on targets by level. Stricter availability and time goals for critical scenarios, softer targets for “convenient”. Honest targets with clear degradation are better than promises that can’t be proven.
-
Add maintenance windows and exceptions. Specify when maintenance is possible, how you notify users, and what is excluded from calculations (for example, external provider failures, security blocks, major network incidents).
-
Describe escalation and reporting. Who accepts an incident, when 2nd line is involved, what communication channel, and the monthly report format (uptime, response time percentiles, rejection rate, causes).
Example: if the bot handles internal requests, the business may require “24/7 like a support desk”. Then the SLA explicitly states support levels and reaction times and what counts as unavailability so disputes don’t appear after the fact.
Common mistakes when writing a chatbot SLA
SLAs are often written so that they sound strict but are unhelpful in practice. As a result, business, IT and users have different expectations and frustration grows.
Typical mistakes:
- Confusing “response time” with “time to resolution.” The bot may reply in 3 seconds but the real task (access to a system) is resolved in 2 hours — then a dispute starts about what “failed”.
- Specifying 99.9% availability but not defining what counts as downtime: interface unavailable, a single scenario failing, integrations down, or rising latency.
- Committing to external dependencies. If the bot depends on HR systems, Service Desk or user directories with their own schedules and outages, you can’t promise a single SLA for everything.
- Not describing behavior under load. During peaks the bot may go silent, duplicate answers or accept requests without results, causing support to be flooded with tickets.
- Making exceptions too broad. Phrases like “any provider problems” or “any maintenance” effectively void the SLA.
A good test: imagine Monday at 09:00 when employees flood requests about accesses and tickets. If the SLA doesn’t say what happens when queues form (for example, show expected wait time, limit some functions, switch to ticket intake mode), you’ve left room for chaos.
Short checklist before approving an SLA
Before signing the SLA, check that the document reads like an instruction: what exactly is the service, how it is measured and what to do if something goes wrong.
Quick checks:
- The service is described concretely: which channels are supported, who owns the service, where to write or call in and out of working hours.
- Metrics are measurable and have units: availability (percentage for a period and how downtime is counted), first response time (e.g. p95), time to resolution (what counts as resolution), request limits (requests per minute, queue behavior, overload behavior).
- Maintenance windows are formalized: when planned downtime is allowed, how users are notified, what happens if maintenance overruns.
- Exceptions aren’t a list of excuses: force majeure is narrow, external dependencies are listed (SSO, knowledge base, mail, LLM provider), and security causes have a clear procedure.
- Control and escalation are thought through: monitoring, report format and frequency, who accepts an incident, when the next support tier joins, and how SLA compliance is recorded.
Test with a real scenario: an employee creates an access request at 09:05 via the bot, the bot responds at 09:07 but the ticketing integration is unavailable. The SLA should clearly state whether this is a violation of time-to-resolution, who notifies the user, how long degradation can continue, and when the on-call engineer is involved.
If any point causes disagreement, clarify the wording before approval rather than after the first incident.
Example SLA for a simple case and next steps
A company of 2,000 employees launches an internal chatbot: it answers HR questions (certificates, leave, travel) and helps create IT tickets, sometimes handing them over to Service Desk.
To keep the SLA clear, split metrics by task type. For FAQ speed and availability matter; for tickets, predictable handoff to support is key.
Example metrics:
- Bot availability: 99.5% per month, measured by successful responses to test queries every 1–5 minutes.
- FAQ: response time p95 no more than 5 seconds (median around 2 seconds).
- IT ticket: confirmation of ticket creation (number and status) p95 within 30 seconds.
- Handoff to operator: during business hours 09:00–18:00 operator’s first response p95 within 10 minutes.
- Limits: up to 20 requests per minute per user and up to 200 per minute for the company; under overload the bot reports a queue and estimated wait time.
Plan maintenance windows to avoid morning peaks. For example: scheduled work on Sundays 02:00–04:00 with notification at least 48 hours in advance. For emergency fixes, set a rule: do not perform changes on weekdays 09:00–11:00 unless there is a security incident.
Exceptions should be measurable. For example, Service Desk unavailability (API or ticket database), corporate network outages or SSO failures are considered external causes: the bot continues answering FAQs, but ticket creation is temporarily unavailable. Also specify that blocking suspicious requests for security reasons is not an SLA violation.
Next steps are simple: a 4–6 week pilot, collect actual metrics (availability, uptime, p95, queues, peaks), then adjust numbers and exceptions.
If you need infrastructure for internal services with clear SLAs (for a bot, Service Desk and integrations), GSE.kz as a system integrator can help with design and support: from selecting servers and workstations to arranging operations and 24/7 technical support within the corporate contour.
FAQ
Why do you need an SLA for an internal chatbot at all?
An SLA records clear expectations: when the bot is available, how fast it responds, and what to do if it doesn't help. This reduces disputes between users, support and integration owners because promises become verifiable with numbers and rules.
What exactly should be considered the “chatbot service” in an SLA?
Describe the service as the path from the user to the result: communication channel, the bot itself and the key integrations without which scenarios don’t work. Separately list what is not the responsibility of the bot team so you don’t promise things that depend on other systems or employees’ devices.
How to define “availability” of the chatbot to avoid disputes about outages?
Separate interface availability and function availability: sometimes the chat opens but requests can’t be created due to an integration. Define downtime unambiguously, for example as no response longer than a specified time or failed synthetic checks on critical scenarios.
How to choose uptime percentage and the evaluation period (month or quarter)?
Start with a realistic goal you can support and prove with measurements, usually calculated monthly. If there are critical scenarios, give them separate targets and monitoring; otherwise overall uptime can hide problems in important functions.
What is the difference between “first response” and “time to resolution”, and which is more important?
The “first response” is reaction speed, while “time to resolution” is the time until the result is achieved (for example, information delivered or a ticket created with a number). It’s better to record both in the SLA because a fast reply without a result is perceived as a service failure.
How to split responsibility between the bot and the Service Desk when a request goes to a human?
Usually the bot is responsible for correct transfer and filling of data up to the moment a ticket is created or escalated; after that, the Service Desk SLA applies. This rule, often called the “stop point”, prevents the bot team from being held accountable for another team’s queue while keeping them responsible for proper routing.
How to account for request peaks and queues so the SLA remains fair?
Set targets not only by averages but by percentiles, typically p95, to account for peaks. Also describe graceful degradation: the bot should quickly inform about a queue or an external-system problem and give a clear next step, even if the request cannot be resolved immediately.
Should an SLA include request rate limits and what to do in case of overload?
Limits protect the service from overload and integration loops, so specify numbers and how they are calculated. Also describe behavior when limits are exceeded so the user sees a clear message and knows when to retry the request.
How to describe maintenance windows so planned work isn’t counted as an outage?
Treat maintenance windows as a separate rule with frequency, maximum duration and notification lead time so they aren’t mistaken for incidents. Also explain what the user will see during maintenance and set a rollback rule if the update goes wrong.
Which exceptions are appropriate in an SLA (external dependencies and security) and how not to make them loopholes?
Exceptions must be narrow and verifiable, for example a confirmed SSO outage or an unavailable external API, not broad phrases like “any provider problems”. For security incidents, allow temporary restricted modes but define who decides, how start and end times are logged, and how users are informed.