24/7 Support for Critical Systems: L1–L3 and Runbooks
24/7 support for critical systems: how to assign L1–L3, write runbooks, and configure alerts and escalations so incidents don’t hang.

Why incidents “hang” in 24/7 support
“Incident stuck” is when a problem is visible but doesn’t move forward. There’s no clear owner, no defined response time, and the status stays “in progress” for months. This is especially noticeable at night and on weekends: fewer people are available, and the cost of mistakes is higher.
Most often 24/7 support breaks down not because of “bad engineers,” but because of organizational gaps. The on‑call sees an alert but isn’t sure if it’s their area. Contacts are outdated, the right person doesn't answer, and no backup is assigned. Or there are so many alerts that important ones get lost in the noise and the team gets used to “checking later.”
The cost of delays is clear: a small degradation turns into service downtime, lost transactions or data, and cascading failures in adjacent systems. For organizations with availability obligations (government services, banks, clinics, educational platforms) this is also an SLA and compliance risk.
Almost everything that saves the night is prepared in advance, not solved ad hoc. You need: one person responsible for the incident at any moment and a rule to assign them; clear L1/L2/L3 role separation with responsibility boundaries; reaction and escalation timings; up‑to‑date contacts and backup channels; closure criteria and mandatory post‑incident notes.
A simple example: at night part of the virtual infrastructure on a host fails. If L1 doesn't know where to check metrics or who to call, they spend 30 minutes “searching for the owner.” With clear roles and instructions those 30 minutes become 3: confirm the symptom, perform first steps, escalate on a timer, and keep the owner informed until recovery.
Criticality, priorities and clear time objectives
If support has no common answer to “how urgent is this?”, incidents start to hang between shifts. First agree on the list of critical services and their components: application, network, servers, storage, integrations, access (AD/SSO), backups. Record not only service names, but what counts as “working/not working.”
Separate these terms:
Incident — a service failure or degradation that prevents users from working or risks data loss.
Request — “do/add/change” without an outage.
Task — planned change (upgrade, migration, configuration).
When this is written in one place, L1 won’t spend the night on “fix the printer,” and L3 won’t be distracted by folder access issues.
Simple priority rules (P1–P4)
Priority is better defined by impact and urgency, not by who’s loudest in chat:
- P1: service unavailable for many users or threat to data/security.
- P2: severe degradation, there’s a workaround but business suffers.
- P3: partial failure for a small group, low risk.
- P4: single issue, consultation, “not urgent.”
To keep support manageable, set goals not only for “recover,” but also for “keep informed.”
Time objectives: response, recovery, status
A set that can realistically be maintained in a shift:
- time to first response (taken into work);
- time to escalate if there’s no progress;
- target recovery time;
- status update frequency for P1/P2;
- time to close and a short note (what was done).
Who approves and changes these rules: the service owner (business or IT) sets criticality and SLA, the head of support owns the process, changes go through a short approval (for example, quarterly or after major incidents). This way priorities don’t drift and L1/L2/L3 decisions look the same at 02:00 and at midday.
How to split L1, L2 and L3 without conflicts
To prevent on‑call from turning into endless chats, split roles by three things: depth of diagnosis, permission to change, and responsibility for communications. Then each level has a clear control area and the incident always has an owner.
L1 — the entry point. Accepts an alert or report, checks basic symptoms, gathers facts (time, affected services, recent changes), runs predefined steps from the runbook, and keeps the business informed: what’s happening, current status, and when the next update will be. L1 shouldn’t spend hours hunting for a root cause. Their job is to quickly classify and pass the case along with quality data.
L2 gets involved when deeper diagnosis and configuration changes are needed: restarting components, switching to a backup, rolling back parameters, working with infrastructure and adjacent teams. L2 often resolves the issue without development work, but owns the technical recovery plan.
L3 — developers and architects. Needed when the cause is complex: a code bug, design flaw, or degradation from rare conditions. L3 provides fixes, assesses risks, and plans releases or temporary mitigations.
To avoid disputes over responsibilities, fix boundaries:
- L1 owns the incident to closure: timeline, statuses, and protocol.
- L2 owns recovery when configuration or infrastructure changes are required.
- L3 owns the root cause and long‑term fix.
- Any production change has a clear executor and approver.
- Escalation is timer‑driven: if a step yields no result, the level is raised.
A small RACI matrix for common tasks is useful: who approves (A), who executes (R), who consults (C), who is informed (I). For example, status to leadership: A = L1, C = L2, I = L3; configuration change: A/R = L2, C = L3, I = L1.
Duty shifts: schedules, handover and backups
24/7 coverage can be organized in different ways, and the choice affects reaction speed and team fatigue. For truly critical systems shifts often run (for example) 2/2 or 12/12, with a person always at the console. On‑call is cheaper but riskier where every minute of downtime is costly. A common compromise: L1 on shift, L2/L3 on‑call for rare but complex cases.
So the on‑call has the rights, tools and contacts from minute one. Minimum: permissions sufficient for diagnostics and safe actions; monitoring, logs, remote access and message templates; list of L2/L3 on‑call and the service owner; a backup channel if the corporate messenger is down; and clear authority boundaries.
Handover is a short ritual that saves hours. Speak and record: open incidents and status, temporary mitigations, risky changes in the next hours, recent alerts (what was already checked and what was done). If a hospital’s servers show delayed responses at night, the incoming shift should get not only the fact but also hypotheses, logs and the next step.
Backups must be formal: a deputy for every shift, a clear response time (e.g. 5–10 minutes) and a procedure for replacement. Decide in advance who confirms the swap and where it’s recorded so there’s no “I thought you were on call.”
Keep a simple metric set to keep shifts manageable: incidents per shift and average time to response; proportion of false positives and main causes; number of escalations and how many were late; load by day/hour; number of replacements and missed shifts.
Escalations: rules, timings and channels
Escalation exists not to "punish" but to prevent an incident from stalling. Simple rule: if the current level cannot advance the plan, they must involve the next level according to a preset timer.
Ladder and timers: when to raise the level
Typical ladder: L1 contains and stabilizes, L2 dives deeper and makes changes within the playbook, L3 is brought in when code, architecture or a vendor is required. Predefine signs that move the incident up: no progress, need for access, growing data risk, or multiple systems affected.
To avoid “five more minutes” debates, set timers by priority. Example:
- P1 (service down): escalate L1→L2 after 10 minutes without a clear plan; L2→L3 after 20 minutes without recovery.
- P2 (severe degradation): L1→L2 after 20 minutes; L2→L3 after 40 minutes.
- Any priority: escalate immediately if there are security incident signs.
Channels and message template
Define primary and backup channels. For example: primary — chat and a call to the on‑call, backup — SMS or a second messenger. Also describe what to do if a channel is unavailable: who calls, how many attempts, and after how many minutes the shift lead is involved.
Escalation messages should save time. Five points are enough: what is broken and since when (symptoms, affected services); priority and business impact; what’s been done and the result; what is needed from the recipient; where you are available.
Separate escalations by direction: infrastructure (network, servers, virtualization), security (suspicious activity, accounts), business (stop operations, switch to manual mode). If a hospital or bank service fails at night, L1 simultaneously raises L2 for infrastructure and notifies the business owner about potential operational stops without waiting for a “perfect diagnosis.”
Runbook: how to write instructions that actually work at night
A runbook is useful only if someone can act by it at 03:00 without guessing. It’s insurance against stuck incidents and varying interpretations of “what to do next.”
A good runbook starts with recognizable symptoms and quick checks: “application unreachable alert,” “disk almost full,” “user authentication errors.” Then provide short safe steps for L1 and clear escalation conditions.
Minimal template that works
Keep a consistent structure so you don’t search for the needed piece at night:
- symptoms and what counts as an incident (1–2 lines)
- quick checks (5–10 minutes) and expected results
- remediation steps: one by one, with success criteria
- rollback: how to revert if it gets worse
- escalation: whom to wake and what to attach (logs, metrics, start time)
If appropriate, add specifics: commands, log paths, an example “normal” message in logs. Avoid steps that require guessing like “check that everything is fine.”
Responsibility boundaries and safety
Runbooks should answer “what can be done without approval.” Clearly separate: allowed actions (e.g. restarting a service with a button or script, collecting logs, switching to a backup by the runbook) and forbidden actions without L2/L3 (config edits, manual DB changes, disabling protections, deleting data). Record any action: time, what was done, and how symptoms changed.
Store runbooks where on‑call can access them: a single location, searchable, with read rights. Each document should have an owner, version, date and a review plan (e.g. quarterly or after major incidents). If you support a fleet of workstations and servers, update instructions after hardware, firmware or backup scheme changes, otherwise night procedures will quickly become obsolete.
Alerts: how to set them so people actually read them
In 24/7 support an alert should help you decide what to do in a minute, not create another alarm stream. The rule is simple: every signal either leads to an action or shouldn’t exist.
First define what to alert on. Usually five zones are enough: user impact (unavailability, spike in errors, degraded response); application errors (repeats, spikes, critical exceptions); resources (CPU, memory, disk, but only when they truly risk outage); dependency availability (DB, external APIs, load balancers, network); queues and background jobs (queue growth, lag, stall).
Then cut the noise, otherwise alerts will be ignored. Deduplication (one incident — one thread), suppression of similar events and maintenance windows for planned work help. A good indicator: if a signal is often closed as “self‑resolved,” thresholds or logic are wrong.
Routing must be predictable: by service (owner), by priority (P1, P2) and by level (L1 checks basics, L2 dives deeper, L3 is involved for code/architecture changes). Decide in advance which alerts wake people at night and which can wait until morning.
Don’t forget the “alert silence”: if monitoring is broken and everything is green, that’s also an incident. Check that metrics arrive, checks run, and notification channels work.
Alert text should answer: what happened and how urgent; where exactly (service, environment, node); a quick way to validate the hypothesis (1–2 steps); what to do now; where to find the runbook and which step.
Example phrasing: “P1: Billing API 5xx > 10% for 5min, prod, pod‑3. Checks: DB status, recent deploys. Action: shift traffic to backup, open runbook Billing‑API step 2.”
Step‑by‑step process from alert to incident closure
To prevent incidents from hanging, the most important thing is a single route for each case. An incident should have a single entry point and a single owner.
First assign a single channel where alerts and reports land (for example, a service desk or incident chat) and assign the shift owner. They accept the event, create a record, give it a number, check for duplicates, and immediately assign a primary executor. Also set a first‑response time rule (e.g. 5–10 minutes for P1) so it’s clear when to raise the next link in the chain.
The executor follows the runbook: collects facts and performs safe first steps. Agree in advance which data are mandatory: exact start time and symptom; impact on users and services; metrics and logs that confirm the problem; recent changes (deploy, config, updates); what’s already been tried and the result.
If the problem isn’t resolved in the allotted window, escalation is triggered by timer, not emotion. Along with the escalation send a short brief: facts, hypothesis, what was done, and what’s needed.
While working, update status on schedule (e.g. every 15–30 minutes for P1). Close the incident only with a summary: what happened, what was done, how recovery was confirmed, and the cause (even if preliminary).
Within 1–2 days after the incident, record follow‑ups: what to change in alerts (thresholds, noise, routing), what to add to the runbook, and what tasks to schedule to prevent recurrence. If the failure happened in a server rack, you often discover that monitoring didn’t show disk degradation in advance — a direct candidate for improvement.
Common mistakes in 24/7 support and how to avoid them
Even with a strong team the process breaks on small things. It’s usually not about people, but about rules not being fixed and not tested under real night conditions.
One of the most costly mistakes is the on‑call not having needed access. At 02:15 they start “asking in chat” for log, console, monitoring or restart permissions instead of acting. Time is lost, SLAs burn, and blame is argued later. Practical fix: predefine minimal L1 rights and a fast safe escalation path for risky actions (e.g. node reboot) with confirmation.
Second common issue — no incident owner. Many people participate, many messages fly, but no one manages the timeline, records decisions or closes tails. Assign an Incident Owner immediately when confirming the incident (often L1) and keep that role until closure.
Escalations are also often set “by feeling.” Too early escalation exhausts L2/L3 and creates conflicts. Too late escalation causes downtime. The rule: escalate by time and conditions (no progress, no data, rising risks), not by emotional intensity.
Typical failures and remedies:
- no access for the on‑call — grant roles in advance and test them in a drill;
- no incident owner — record Incident Owner and contact channel from minute one;
- escalations without timers — set thresholds (e.g. 10–15 minutes without progress) and handover criteria;
- runbooks that are “for show” — run through them monthly and update after changes;
- alerts without context — add “what this means” and first 2–3 actions in the notification.
Without post‑incident work outages repeat. After closure answer three questions: what broke, why we didn’t detect it earlier, and what we will change (alerts, runbook, accesses).
Example scenario: a night outage and L1–L3 working by the rules
02:13. A critical operations processing service fails: users see errors and cannot complete actions. Monitoring raises an alert “5xx errors above threshold for 3 minutes” and calls the on‑call.
L1 covers the first 10–15 minutes. They confirm the alert isn't false: check the dashboard, attempt a typical operation, and see whether one region is affected or everything. Then they open an incident, record the start time, save key facts and run the runbook’s first steps: a safe component restart, check dependency availability (DB or queue), and compare with the last known “normal” state.
After 7 minutes it’s clear quick steps didn’t help, and L1 escalates to L2 per the rule: “if no improvement in 10 minutes, raise second level.” L2 checks recent changes: deploys, configs, certificates, limits. They find that a planned config change increased a dependency’s response time. L2 rolls back the config to the last stable version and restores the dependency. Errors drop and the service recovers.
L3 is involved not to fight the fire, but to find the cause. They investigate why the change passed unnoticed, what checks were skipped, and how to prevent recurrence. They produce a plan: an additional test, throttling, code fix, and a change to deployment policy.
A short timeline is recorded:
- 02:13 alert and confirmation
- 02:18 incident opened, runbook steps started
- 02:23 escalation to L2
- 02:31 rollback and service recovery
- 02:45 service stable and monitoring normal
Post‑incident review covers what failed, why, detection time, duration, what worked and what didn’t. Update alerts (for example, add a signal for dependency degradation before 5xx) and the runbook (explicit step “check recent changes and rollback by template”) so L1 won’t waste time guessing next time.
Short checklist and next steps for implementation
If incidents are hanging, the cause is usually holes in the process: no owner, unclear priority, no escalation timer, lack of night context. Check the basics you can do in 30 minutes.
Make sure the team acts the same for every alert:
- an incident owner is assigned (one person owns to closure, even if tasks go to L2/L3);
- priority and time goals are recorded (when work should start and when escalation must fire);
- escalation timer starts immediately on registration;
- primary communication channel is clear and a backup is defined;
- on‑call and deputy are preassigned, handover is short and consistent.
Next, review runbooks for critical services. The night on‑call should open the document and understand within a minute: what broke, how to confirm the symptom, how to safely roll back, and who to call. Minimum for each service: the first 3–5 diagnostic steps, the criterion for “this is L2/L3,” the owner contact and a list of typical risks (for example, what not to restart without approval).
For alerts the logic is simple: every signal must have a meaning and a route. If notifications duplicate or arrive "just in case," they get ignored. Start with the 10–20 most important signals and make them "see — understand — act."
The next step is a short audit of critical systems: list services, allowable downtime, single points of failure, owners, current monitoring and communication channels. Then adapt the process to your infrastructure and test it in a drill.
If you need help organizing 24/7 support and integrating infrastructure, GSE.kz can provide not only support practice but also system integration experience for organizations with high availability requirements (including servers and workstations).
FAQ
Why do incidents “hang” in 24/7 support and what should I do first?
By default assign an **Incident Owner** immediately after confirming the incident (usually the on‑call L1) and record this in the incident entry. The owner doesn't have to "fix everything" personally, but they must: - keep the timeline and statuses; - start escalation timers; - gather facts and hand them to L2/L3; - close the incident with a summary.
How to quickly and consistently determine priorities P1–P4?
The simplest rule: priority is determined by **impact** and **urgency**, not by who shouts the loudest. A practical starting point: - **P1**: service unavailable / threat to data or security; - **P2**: severe degradation, there is a workaround but the business suffers; - **P3**: partial failure affecting a small group; - **P4**: single request / consultation. Also immediately document what “working/not working” means for each critical service.
Which SLAs/timings are needed so the process doesn't fall apart at night?
Minimum set you can realistically keep during a shift: - time to **first response** (taken into work); - time **to escalate** if there is no progress; - target **recovery time**; - frequency of **status updates** (especially for P1/P2); - time to **close** and write a short note. If you pick one thing, start with an escalation timer: it’s the best protection against incidents "hanging."
How to split L1/L2/L3 so nobody says “that's not my area”?
Separate by three things: depth of diagnosis, permission to change, and responsibility for communications. Practical scheme: - **L1**: confirms symptom, collects facts, performs safe runbook steps, maintains status; - **L2**: makes config/infrastructure changes, owns recovery plan; - **L3**: investigates root cause (code/architecture) and provides long‑term fixes. Document boundaries: what L1 can do alone and what requires L2/L3.
When to escalate and how to avoid "just five more minutes"?
By default escalate **by timer** if there is no clear progress. Example working timers: - **P1**: L1→L2 after 10 minutes without a plan; L2→L3 after 20 minutes without recovery. - **P2**: L1→L2 after 20 minutes; L2→L3 after 40 minutes. Escalate immediately if there are signs of a security incident or risk of data loss.
What exactly to include in an escalation message to speed up help?
Keep a short 5‑item template: 1) what broke and since when (symptoms); 2) priority and business/user impact; 3) what was done already and the result; 4) what you need from the recipient (access, action, decision); 5) where you are available and when the next update will be. This lets L2/L3 start working immediately instead of extracting details in chat.
What kind of runbook actually helps rather than lying around "for show"?
A runbook must allow someone to act at 03:00 without guessing. Minimal template: - symptoms and the criterion that this is an incident; - quick checks for 5–10 minutes with expected outcomes; - step‑by‑step mitigation actions with success criteria; - rollback: how to return to the previous state if things get worse; - escalation: who to wake and which logs/metrics to attach. Also list what L1 may do without approval and what is forbidden.
How to tune alerts so the on‑call doesn't ignore them?
Simple rule: **every alert must lead to an action**. To reduce noise: - deduplicate (one incident — one thread); - suppress repeated events and group similar ones; - use maintenance windows for planned work; - revisit thresholds if a signal is often closed as "self‑resolved." Start with 10–20 most important signals and make them "see — understand — act."
Which duty model (shifts or on‑call) usually works for critical services?
For critical systems a common compromise works best: - **L1 on shift** (person at the console); - **L2/L3 on‑call** for complex, rare cases. Minimum at shift start: - access for diagnostics and safe actions; - monitoring, logs and remote access; - up‑to‑date contacts for on‑call personnel and service owners; - a backup communication channel; - clear boundaries of authority. If the on‑call lacks access, the incident is almost guaranteed to drag on.
What to do during handover and after incident closure?
Handover should be short and consistent every time. Record: - open incidents and current status; - what steps have been taken and the results; - temporary workarounds; - risky changes planned for the next hours; - the "next step" and who is assigned to it. After closing an incident leave the minimum: what broke, what was done, how recovery was confirmed, and what will be changed (alerts/runbook/access) to prevent recurrence.