What “zero downtime” means in practice

“Zero downtime” does not always mean “no single interruption.” More often it means that end users don’t experience noticeable service interruption and the upgrade is performed so that its impact fits within pre-agreed limits.

When upgrading Cisco Catalyst devices, short events that are technically outages are usually accepted but practically unnoticeable: brief flaps of individual ports for a few seconds, a short pause while LACP re-establishes, updates to routing tables, STP root re-election, or a few lost packets during failover.

To avoid arguments after the maintenance window, agree with the business on measurable success criteria in advance:

no more than one brief disconnect for a client on Wi‑Fi or wired
no more than X seconds degradation for critical applications
zero manual actions required at user workplaces
verification of key services after the work

Even with ISSU, “micro-pauses” can occur. They are more noticeable on services without buffering and with continuous streams: IP telephony and call centers, payment terminals and POS, video surveillance, VDI, VPN sessions.

A simple example: an office may not notice a 2-second port reconvergence, while a call center agent may hear a click and lose a phrase. So “zero downtime” is about expectations, metrics, and the list of services that must survive the upgrade without perceptible loss.

Is your Catalyst suitable for ISSU and a non-disruptive upgrade?

Seamless upgrades are possible only where real redundancy exists. At the access layer this often means that a brief outage on one device is not critical (users remain on Wi‑Fi or move to a backup uplink). In distribution and core layers, without paired devices or chassis with a redundant supervisor, you’ll usually get only a “minimal outage,” not zero.

First, identify the device role: access, distribution, or core. The closer to the core, the higher the cost of even a second of routing loss and the stricter the redundancy requirements (two devices, dual power, dual uplinks, diverse paths).

Next, check whether the platform and configuration support seamless transition mechanisms: SSO (stateful switchover) and NSF (non-stop forwarding). If these are not supported or not enabled, ISSU can easily become a normal reload with a noticeable outage.

Practical minimum checks to confirm before scheduling a window:

exact model and deployment mode: standalone switch, stack, chassis, StackWise Virtual
redundancy in place (pairing for LACP/ECMP, HSRP/VRRP, two uplinks to different devices)
whether your IOS XE branch supports ISSU in your specific setup (features, licenses, stack type)
current and target versions defined, and whether there’s a direct upgrade path
what counts as the single point of failure: one switch, the entire stack, or the supervisor

To avoid guessing, collect facts with these commands and save the output to a file (useful for rollback too):

show version
show redundancy
show platform
show switch (for a stack)

Example: if you have a single Catalyst in a small office and all workstations depend on it, ISSU won’t help—any upgrade is risky. But if you have a pair of distribution switches with FHRP and duplicated uplinks, ISSU with SSO/NSF gives a chance to survive the upgrade without session loss, provided model, version and feature sets match.

Preparing images, compatibility and device space

Start not with commands but with checks: does the chosen IOS XE version fit your hardware and current boot mode?

Match the target IOS XE to the switch model, supervisor type (if present), memory size and current licenses. Some features (encryption, Network Advantage, DNA) may require specific licenses or restrict upgrades to certain releases.

Also verify ROMMON requirements. For some families and branches of IOS XE a ROMMON upgrade is mandatory and may require a reload. If the vendor recommends an intermediate release (upgrade to X first, then to Y), plan that rather than discovering it during the maintenance window.

Before copying the image, determine the device boot mode:

Install mode (packages.conf and a set of packages) is usually preferred for ISSU and predictable upgrades.
Bundle mode (booting directly from a .bin) is simpler but often limits seamless upgrade capabilities.

Ensure there is enough free flash for not only the new image but also for a backup. Minimum: space for the old .bin (or packages), the current packages.conf, and room for temporary installation files.

After copying the image, verify its checksum. Typical commands:

show version
show boot
dir bootflash:
verify /md5 bootflash:<image>.bin

If flash is low, don’t delete files at random. First confirm what the switch currently boots from and which files are definitely needed for rollback. On Catalyst in install mode, deleting the wrong package can prevent the device from booting into the expected release even after installation.

Pre-checks before the maintenance window

The main risk is often not the upgrade command itself, but that the device was already “on the edge.” Before the window, capture a snapshot of the state so you can quickly tell whether post-change issues were caused by the upgrade or existed earlier.

Record the current version and the device role in the network (core, access, aggregation). Save a configuration backup and the outputs of key commands in a separate file. A snapshot for comparison after the window is useful: list of uplinks, port-channels, neighbors and active gateways.

Hardware and device health

Check power and cooling: presence of both PSUs (if available), fan status, temperature, and power-related logs. Port errors are important to see in advance: rising CRC/input errors on an uplink can turn a “no-downtime” plan into short interruptions when protocols restart.

Assess load: CPU and memory, free flash space, presence of crashinfo. If crashinfo already exists, investigate it before the upgrade so you’re not troubleshooting two events at once.

Redundancy and critical tables

Ensure redundancy actually works, not just “is configured.” Verify uplinks are not single, EtherChannel is up and error-free, and HSRP/VRRP roles are as expected (active/standby) with no flaps.

Before starting, capture tables that help with post-incident analysis: ARP, MAC, routes (or default route), STP state (root/port roles). If users later report loss of access to a segment, comparing MAC/ARP/STP before and after quickly shows whether traffic moved to another uplink or the root bridge changed.

By the end of pre-checks you should have a clear set of facts: the device is stable, redundancy is alive, and you have baselines for comparison.

Quick checklist before starting the window

Before the window, verify you can manage the device even in an emergency. This takes 3–5 minutes but often saves long recovery time.

Check management access: is there a console (on-site or via a console server) and a backup path? A good minimum is a separate OOB channel or a second independent route to the management interface. Ensure you have the correct credentials and privileges, and that the device clock is synchronized.

Confirm where the image lives and how you’ll quickly transfer it in a contingency. If network transfer fails, have a fallback (alternate server or a local copy on an engineer’s laptop).

Also agree on decision points: who calls “go/no-go” and based on which signals. Typically 2–3 stop triggers are enough: loss of management, neighbor adjacency not coming up, critical errors in logs after reload.

Access: console + backup management channel (OOB/second path), verified logins and rights.
Image: location and checksum confirmed, plan for fast delivery (main and backup).
Decisions: responsible person assigned for go/no-go, stop criteria and rollback triggers recorded.
Monitoring: agreed enhanced observation after work (usually 15–30 minutes) and what to watch.
Contacts: application owner, on-call engineer and provider/operator (if involved) available.

Also agree on a period of intensified monitoring: how many minutes after completion the team stays on line and refrains from unrelated network changes. This reduces the chance that a rare issue appears after everyone has left.

Step-by-step: standalone Catalyst with IOS XE and ISSU

Resiliency audit

We will find single points of failure: uplink, power, OOB and stack.

Order an audit

On a standalone switch the key condition is simple: the platform and current configuration must support ISSU, and the device should be in install mode (IOS XE running from packages rather than a single monolithic image).

Upgrade steps (install mode and ISSU)

Start by preparing packages and the image. If the device is not in install mode, switch it in a separate window since that may require a reload.

Then proceed in sequence:

Copy the new image to the device and verify the checksum (if provided by the vendor).
Run install add and wait for completion, then verify that packages were added correctly.
Trigger the switch to the new set via install activate (or use the ISSU command if available for your platform and version).
During activation monitor that SSO remains functional and NSF does not drop unexpectedly.
After the new version runs successfully, run install commit to finalize the upgrade.

What to monitor during the transition

During activation watch system logs and protocol states. In a small network a brief STP recalculation or neighbor reinitialization may look like a micro-loss but should not turn into a long outage.

Confirm success criteria immediately: ports in expected state, no recurring errors in logs, key VLANs and uplinks operational, CPU and memory not stuck at high utilization, and basic checks (ping to gateway, access to critical services) show no degradation.

Step-by-step: upgrading a Catalyst stack

Stacks are usually upgraded with a rolling upgrade: members reboot one by one while the network continues to operate using active/standby roles. Practically this is perceived as “no downtime” if you have uplink redundancy and a healthy stack.

First record the stack state: who is active, who is standby, member numbers and priorities. Unexpected role changes can cause interruptions even if the firmware installs correctly. A member that is “half-dead” before the window is a common surprise discovered only during the upgrade.

Before starting the upgrade

Check the items that most often save the window:

all members visible in the stack, no “half-dead” members
StackWise health and cables (errors, flapping, poor connectors)
boot mode and IOS XE version consistent across members
image available for all members and sufficient space
configuration saved, roles and priorities documented

Reiterate expectations: during a rolling upgrade some ports may blink briefly and some access devices will briefly reconnect if they lack redundancy.

Upgrade (typical order)

Align software: same image and install mode on all members.
Start installation/activation and ensure members reboot sequentially, not simultaneously.
Verify standby is ready to take active role (SSO) before rebooting the next member.
Wait until all members fully rejoin the stack before finalizing.

Example: in a 4-member stack with critical uplinks on members 1 and 2, ensure the standby on member 2 is truly ready before rebooting member 1.

After completion verify versions are consistent, the stack is healthy with no errors, and roles and priorities are preserved.

Rollback plan: when and how to go back

Controlled maintenance window

GSE engineers will perform the upgrade and stay on call during the observation period.

Submit a request

You need a rollback not because you expect failure, but so you can quickly restore service if something goes wrong. The key moment is the “point of no return”: commit. While the upgrade is not committed, returning to the previous version is usually easier and faster.

When is it still quick to revert?

With ISSU (or install flows) the device may hold both the old and new package sets. Until you commit, you are effectively testing the new version in production. After commit the system treats the new version as primary and rollback often becomes a full reload and manual boot back to the previous image.

Prepare a “plan B” in advance: the previous image must be available on flash and the boot configuration clear.

Rollback options and differences

Rollback (install/ISSU rollback): fastest option if you haven’t committed and the installation method supports rollback.
Install remove (cleaning inactive packages): this is not a rollback. It helps clean up but doesn’t restore the old version.
Manual boot: explicitly set the boot to the previous packages.conf or image and reload. This is the heaviest method but necessary if automated rollback isn’t available.

Tie the rollback decision to symptoms, not feelings:

STP loops, widespread topology change, control-plane CPU spikes
loss of adjacencies (HSRP/OSPF/BGP), uplink flapping
sharp increase in interface errors, PoE failures on critical ports
user complaints confirmed by metrics (latency, disconnects, application degradation)
situation not stabilizing with simple actions (config rollback, disabling a problematic feature flag)

Communication is part of the plan. Define who orders the rollback, who we notify, and what is logged:

start time of rollback and reason (1–2 sentences)
rollback method and exact steps performed
state of adjacencies and key services before and after
who was notified (NOC, service owner, security)
outcome: returned to previous state or continued upgrade

Example: if after switching to the new version massive STP changes occur and uplinks start failing across stack members, stop before commit, document symptoms, perform a rollback and investigate outside the window.

Post-checks after the window

After the upgrade it’s important not only to see the switch booted, but to confirm the network behaves the same (or better) than before. Some issues only surface after 10–30 minutes under load.

Start with quick facts: software version matches the plan, device(s) in expected boot mode, uptime consistent with the maintenance time, and no recurring warnings in logs. For stacks verify all members returned, active/standby roles are correct, and no signs of desynchronization.

Then check the main risk points:

Interfaces: critical ports are up, speed/duplex correct, error counters and drops not increasing.
PoE: APs/phones receive power, no persistent power-deny.
L2/L3: STP stable, LACP port-channels intact, routing (OSPF/BGP if used) in full/established state.
User perspective: logins, telephony, Wi‑Fi, file and application access work from several segments.
System health: CPU/memory normal, temperatures OK, no new alarm/critical messages.

To gather confirmations quickly, run a standard set of commands (names may vary by model/version):

show version
show logging | last 50
show interfaces status
show interfaces counters errors
show power inline
show spanning-tree summary
show etherchannel summary
show ip ospf neighbor
show ip bgp summary

Finally save the configuration, capture a new baseline “state snapshot” (versions, stack roles, key counters and adjacencies) and attach it to the change record. This greatly speeds post-event analysis if someone complains “it got worse after the upgrade.”

Common mistakes and pitfalls when upgrading Catalyst

Most failed windows come from upgrading “by memory” without checking platform constraints. In seamless scenarios this is especially visible: you assume ISSU will handle everything, but small details can break the plan.

Compatibility: ROMMON and boot mode

ISSU or even a normal reload can fail if ROMMON and boot mode aren’t checked. A common case: you load a new IOS XE but after reload the device goes to ROMMON due to an old boot parameter or incompatible ROMMON, and you’ve lost remote access.

Before the window check basics:

ROMMON version fits the target release and platform
boot mode matches the plan (install/bundle)
flash has enough space plus margin for packages and logs
image file is exactly for your platform and license

“Redundancy” only on paper

ISSU won’t save you if redundancy is only theoretical. One uplink, one power supply, one path to the management plane—any of these single points can break the plan. A frequent case: after supervisor switchover the MAC/port role changes, the uplink blips for a second, and it turns out the second uplink wasn’t actually configured or was administratively down.

Stack: mismatched priorities and “dead” standby

A stack appears as a single device, but it’s multiple switches. Mismatched priorities, different versions, or an unready standby can cause unexpected master election and extra reboots. Worse if a member was invisible before the window and only noticed during the upgrade.

Committing too early and lack of emergency access

Commit is often run immediately, leaving no observation time. It’s safer to let the network run under load, check uplinks, adjacencies and logs, then commit.

And always plan access: if you lose network management access, without a console and a clear recovery plan rollback can become a long outage even if it’s technically possible.

Example scenario: an organization upgrade with no service interruption

ISSU readiness check

We will assess SSO/NSF, install mode and redundancy before the maintenance window.

Check the network

Imagine a mid-size organization: a 2-switch Catalyst stack in the server room (core/aggregation), and two access switches in another building connecting workstations and telephony. The requirement: no noticeable interruption for users, except short second-level switchover events.

The team performs a “dry run” a day before and completes background-safe tasks overnight: model/version verification, power and redundancy checks, flash space and checksum verification, collection of baseline CPU/port errors/HSRP/VRRP or stack states, and communication plan. Images are copied to devices overnight and configurations saved so the window doesn’t waste time transferring files.

The 60–90 minute window is role-based: one engineer drives the console and runs commands, a second monitors services (telephony, ERP, server access), a third coordinates with on-site staff. The key metric is “is traffic alive,” not just “did the software boot”: pings to gateways, access to test services, and no increase in errors or drops on uplinks.

Rollback decisions are fact-based. For example, if a stack member’s SSO doesn’t come up, uplink errors increase, or a critical VLAN is down longer than the agreed threshold (e.g., 3–5 minutes), trigger return to the previous image using pre-agreed steps.

After the window save artifacts for the report and future upgrades: version outputs and stack/role states before and after, event logs with timestamps, error counter snapshots on critical interfaces, confirmation of availability for 3–5 reference services, final configuration file and list of changes.

Next steps: institutionalize the process and reduce risk

Seamless upgrades are not a one-time success but a repeatable process. A “standard upgrade” helps: consistent versions, clear device roles and the same checks every time.

A good practice is to choose a “golden” IOS XE release for your Catalyst model and update toward it on a schedule (quarterly or semi-annually) rather than only when forced. This reduces the chance of jumping multiple releases and facing compatibility issues.

A working standard typically includes: pre-check/post-check templates (commands, expected values, who confirms), rules for storing images and backups, stop-signal criteria for the window, an update calendar and a minimal rollback plan that was at least once tested in a lab.

Observability is essential. If you can’t see symptoms early, even a successful window can lead to complaints later. Basic monitoring is enough: system and stack events (reboots, split-brain, master changes), port errors (CRC, input/output errors, flaps), channel quality (loss, latency, jitter on key paths), power and environment (PSU, PoE budget, temperature).

Also check hardware limits: single PSU instead of two, overloaded PoE, no uplink redundancy, ports at capacity, or a stack relying on a single cable. Often hardware constraints—not the firmware—break seamless behavior.

If you need a turnkey plan (preparation, test, window and support), consider engaging a systems integrator. For example, GSE.kz (gse.kz) provides system integration and 24/7 technical infrastructure support, and can also help bring servers and workstations into compliance if you want a uniform standard across the perimeter.