Why do I need a pre-upgrade checklist if upgrades usually follow "the instructions"?

It ensures the upgrade is predictable and controlled. The checklist proactively covers three things: what to check before the maintenance window, how to recover quickly in case of failure, and by what criteria the upgrade should be considered successful.

What device data should I collect first before an IOS-XE/NX-OS upgrade?

At minimum: device model and role, current software version and mode (stack/vPC/dual-sup), availability of OOB/console access, free space in flash/bootflash and the boot path. Also take "before" snapshots: configs, versions, licenses, neighbor states and key interfaces.

What is most important to check in the compatibility matrix and release notes?

First check basic release requirements: ROMMON/bootrom, and for NX-OS also EPLD, plus memory and storage limits. Then ensure the upgrade path is allowed (sometimes an intermediate release is required) and there are no known issues affecting your features.

How do I avoid losing licensed features after the upgrade?

Record the current license state and which features are actually used in the config. Before the window, make sure the device can validate entitlements after reboot (account/serial binding/subscriptions), and agree in advance what to do if validation fails.

Which hardware and resource checks are truly critical before an upgrade?

Commonly you run out of space for the new image or a backup, or RAM is barely enough. Another frequent risk is module compatibility: a device may boot but some ports or functions will be unavailable with the target release.

How to check feature compatibility if the network has "lived its own life" for years?

Verify what is actually enabled and running: routing protocols, VRFs, QoS, ACLs, HA mechanisms, telemetry and AAA. Then check whether defaults or command syntax change in the target release, because those silent changes often cause degradations without explicit errors.

Which "before" snapshots should I save to quickly find what broke later?

Back up running/startup-config and save key command outputs for comparison: version, inventory, licenses, neighbor states, routing table, port and aggregate status. Also confirm console and OOB access work, otherwise you may lose management when problems occur.

How to run a pilot upgrade so it actually helps rather than being a checkbox?

Pick devices that are similar to production but low-risk, and define success criteria beyond "the device came up." In the pilot measure real timings (copy, install, reboot, protocol convergence) and perform the rollback once to ensure it is fast and repeatable.

Which rollback triggers should be fixed in advance to avoid arguments during an incident?

Define what rollback means for your environment: booting the old image, switching boot variables, reverting install packages, or restoring a pre-upgrade config. Triggers must be measurable: management access not restored, HA role not recovered, critical protocols not up within the agreed time, or a required licensed feature unavailable.

In what order should devices in a vPC/stack network be upgraded and why does it matter?

Start with less critical nodes and don’t mix many changes at once so you can quickly isolate the cause. In HA pairs and stacks follow the exact order and verify after each step; otherwise mismatched versions or roles can cause prolonged outages.

Pre-upgrade checklist for Cisco IOS-XE and NX-OS: compatibility

Why a pre-upgrade checklist matters and what it covers

Upgrading Cisco IOS-XE or NX-OS often looks simple: copy the image, reboot, verify. In reality, failures usually stem not from the firmware itself but from surrounding details that weren’t checked in advance.

Typical risks repeat: service downtime due to unexpected reboot or long boot, loss of features (protocols, ciphers, telemetry), hardware or module incompatibility. Licenses are a separate pain: a device may boot, but some capabilities might be unavailable or require reactivation.

A pre-upgrade checklist addresses three goals: predictability, speed of recovery, and clear success criteria. It helps verify hardware and resources in advance (memory, space for images, module condition), release version and limitations, the actual configuration and used features, licenses and dependencies (VRF, EVPN, FEX, ISSU, encryption, TACACS/RADIUS, etc.), and to prepare rollback steps.

Deliverables should include:

a list of devices and their role in the service (what is critical, what can be touched first)
a work plan with a maintenance window and responsible people
backups of configurations and “before” snapshots for comparison
a rollback plan and triggers for when to run it

In system-integration projects (for example, networks serving a hospital or bank) such a checklist often saves hours of downtime: decisions are made ahead of time, not during an incident.

Inventory: what you need to know about each device

Before opening the maintenance window, collect the inventory. This is the basis for upgrading IOS-XE and NX-OS: without accurate data it's easy to pick the wrong image, license, or even reboot order.

Create a short card for each device — only what affects the upgrade and downtime risk:

model, serial number, role (core, access, edge), site and service criticality
current software version, ROMMON/BIOS version (if applicable), uptime and last reboot date
operation mode: standalone, stack/StackWise, vPC/VSS, FEX, dual-sup, ISSU (if used)
where images are stored and free space in bootflash/sup-bootflash
who is responsible and how you will access the device if the network partially fails (OOB, console, terminal server)

Then capture the configuration and record what’s hard to reconstruct from memory: interface list and purpose, VRF, routing (OSPF/BGP/EIGRP), QoS, ACLs, NAT, DHCP, NTP, SNMP, TACACS/RADIUS.

Call out non-standard items separately: EEM applets, custom scripts, autoprov, monitoring integrations, compliance requirements. These details often break first after an upgrade.

Example: if you have a pair of Nexus switches in vPC, note the critical VLANs and port-channels, which side is primary, and whether peer-gateway is enabled. That helps avoid confusing the operation order and quickly spot changed behavior after reboot.

Version compatibility matrix: what to look for in the documentation

A compatibility matrix helps you understand in advance what will change when moving to the target release and where the upgrade will hit limitations. For the checklist it’s useful to consolidate: platform, current version, target version, requirements, risks and the allowed upgrade path.

Minimum requirements and the release “plumbing”

Start with basic platform requirements rather than new features. Release notes and matrices usually list:

minimum bootrom/ROMMON versions, and for NX-OS also EPLD
recommended versions (not only “supported” but “recommended”)
memory, storage and media-type limits
specifics for certain models and modules (line cards, supervisor modules)

Check whether the low-level components meet the requirements. Otherwise you risk partial boots or problems after reboot.

What changes in behavior and how to upgrade

Next look for changes that are not obvious: deprecations, new default values, command syntax changes, updates to cipher or authentication requirements. Also confirm the upgrade path: whether you can jump directly or need intermediate releases.

To enable quick decisions, prepare a short stoplist per platform:

ROMMON/EPLD upgrade required, but there’s no window for it
deprecated commands are in use
the target release does not support your module/transceiver/feature
jumps across versions are forbidden and an intermediate step is required
a known critical bug exists for your scenario (EVPN, HSRP, QoS, etc.)

Licenses: how not to lose features after the upgrade

Upgrading IOS-XE or NX-OS changes not only bugs and features but also how the device validates entitlements. Therefore check licenses as strictly as the image itself; otherwise you might get an unexpected downgrade of capabilities after reboot.

First capture the current state: which packages and features are active and what is actually used in the configuration. For IOS-XE check Smart Licensing and SKUs (Network Essentials/Advantage), for NX-OS check installed feature sets and their status. Note the truly critical items: AAA and encryption, routing, segmentation, VXLAN/EVPN, telemetry.

Quick pre-window checks:

save license command outputs and store them with the config backup
verify binding to serial number, Smart Account/Virtual Account and subscription expiry
ensure the target release does not change the licensing model for your platform
check for grace periods and what will be disabled when they expire

Example: after an upgrade a switch boots but VXLAN fails to start because the device couldn't reach the licensing service. Pre-agree the actions: who confirms purchase or transfer, who has account access, and what temporary measures are acceptable (move the window, rollback, temporary license, route service to a backup path).

Hardware and resources: modules, memory, image storage

Before upgrading, verify not only the software version but the hardware: which modules are installed, how much free space is in flash/bootflash, whether RAM is sufficient, and any hidden power or cooling issues. These show up first after a reboot.

Start by inventorying modules and their compatibility with the target release. For IOS-XE this is especially important on chassis with mixed line cards and supervisor modules; for NX-OS it matters on platforms with separate modules. If a module is unsupported, the device may boot without some ports or functions.

Check resources and resilience margins:

free space in flash/bootflash for the new image and a backup of the old one
sufficient RAM for the chosen image (not "barely enough")
CPU load and general stability before the window
PSU and fan health
correctness of boot variables and boot order

Commands depend on the platform, but typically collect this minimum:

show inventory
show version
dir flash:
show platform resources
show environment

For stacks, clusters and pairs (e.g. vPC) ensure device and module revisions don't conflict and that the “hardware + target release” combination is supported. Practical example: in a stack one switch with a different memory revision may boot slower or fail to accept the image, and the stack may not fully come up.

Feature compatibility: compare real config with the target release

Network audit before upgrade

We will check compatibility, resources and risks before the upgrade window.

Request an audit

A common mistake is to check "how it should be" rather than "how it actually runs." Start with facts: the current configuration, enabled features, neighbors, policies. This is the feature-level compatibility check.

Build a list of features in use

Better to extract features from command outputs and configs rather than memory. For NX-OS, feature lists help; for IOS-XE, search running-config for routing, security, QoS, monitoring. Also record operation mode: L2/L3, VRF, protocols, Stack/vPC, active/standby.

Common useful commands:

show running-config (and interface, routing, security blocks separately)
show feature or equivalent
show ip route, show bgp/ospf neighbor
show vpc, show switch/stack, show redundancy
show policy-map interface, show access-lists

Compare with target release and catch “silent” changes

For each feature found, check two things: whether it is supported in the target version and whether default behavior changes. The latter is sneakier: traffic can degrade without explicit errors.

Specifically review scenarios that often break on incompatibilities:

HA: SSO/NSF, Stateful switchover, ISSU (if planned)
vPC/Stack and FHRP (HSRP/VRRP) at borders
BGP/OSPF and policy objects (route-map, prefix-list)
QoS (classes, queues, trust boundaries)
ACLs, NetFlow/telemetry

Example: a pair of switches in vPC handle an HSRP gateway and have QoS on an uplink. After an upgrade the QoS application order or default timers might change. You may not see a drop, but latency increases and traffic skews to one link. Mark such items as must-test during the window.

"Before" snapshots: backups, checkpoints and emergency access

Before the upgrade take a state snapshot so you can quickly see what changed and revert if needed. The common issue is not the image but the lack of backups, console access, or anything to compare before/after.

Save the configuration and record which images and files reside in flash/bootflash.

show running-config
show startup-config
dir flash:
dir bootflash:
show version

Then capture brief verification outputs for post-upgrade comparison. Don’t collect hundreds of pages: enough to confirm network and service health.

neighbors: CDP/LLDP, routing protocol status
route tables and ARP/ND
interface and Port-Channel state
power, modules, interface errors

Before copying the new image verify checksums (MD5/SHA) and that there is space for a second image if you want a standby file nearby.

Finally — emergency access. Ensure a working console (local or via console server) and Out-of-Band (mgmt VRF, dedicated port) access, and confirm credentials in advance.

Pilot and testing: how to validate the upgrade with low risk

A pilot proves the checklist in reality, not on assumptions. First upgrade a small part of the network that mirrors production but won’t break business if something goes wrong.

Start with clear success criteria. "Device booted" is not enough. Predefine what must match before and after:

neighbors and routing: BGP/OSPF/EIGRP are up, routing tables are not empty
switching: VLANs, trunks, STP with no unexpected recalculations or blocks
access: AAA/TACACS/RADIUS, SSH, SNMP, NTP, syslog function as before
performance and errors: CPU/memory normal, no increase in drops, CRCs, or flaps
critical features: VXLAN, vPC, HSRP, QoS, NetFlow (or whatever is critical for you)

Select pilot devices by “low risk but similar config” principle. A good candidate is a pair of access switches and one distribution node that share the same features as production but have alternate paths.

Measure timings: copying the image, checksum verification, install, reboot, protocol convergence and sync (for stacks/clusters). These numbers form the real estimate for the window.

Importantly: perform the rollback manually at least once. If rollback in the pilot is not fast and predictable, it will fail in production.

Rollback plan: exactly what to roll back and based on which triggers

Inventory and pre-upgrade snapshots

We will organize inventory and pre-upgrade snapshots so you can quickly compare states.

Set up collection

Plan rollback before the upgrade because there is no time for calm decisions during failure. Define what "return" means: boot from the previous image, revert packages in install mode (add/remove/activate/commit), enter rescue, or switch boot variables to the old file.

A good rollback plan is a short step-by-step guide with commands, expected outcomes and decision points. Example: "if the device boots but critical protocols don't come up, do X; if that doesn't resolve within N minutes, do Y and boot the old image." Predefine who decides to rollback and how service restoration is confirmed.

Rollback triggers

Predefine symptoms you consider critical and how long you will wait:

device fails to reach operational state or drops to ROMMON/loader
management access (SSH/console via OOB) not restored within the agreed time
failure of the agreed validation checks (routing, HSRP/VRRP, BGP/OSPF, vPC/Port-Channel, L2/L3 gateways)
recurring crashes, high CPU or module errors after boot
loss of a licensed feature required for service

Before the window ensure the old image is actually available on the device (flash/bootflash), intact (checksum) and configured as the fallback. Also evaluate config compatibility on rollback: new commands introduced after the upgrade may not be supported by the old release. Practical approach: keep a "pre-upgrade" config backup and be ready to restore it immediately after booting the old release.

Example: on a pair of NX-OS switches with vPC decide in advance that if the vPC peer-link is down for more than 10 minutes you first roll back the secondary switch, verify vPC recovery, and only then decide about the primary.

Step-by-step work plan: from the maintenance window to service confirmation

Describe the maintenance window as a timed scenario with checkpoints, not "we’ll upgrade and see." This helps keep pace, stop on time, and avoid extending downtime.

Window timeline

Estimate durations for each step and add buffers:

T-30 min: final confirmation of the window, access and console, verify images and checksums
T0: start, freeze changes, record baseline metrics (CPU, memory, uptime, interface errors)
T+X: perform the upgrade as planned, monitor boot and network return
T+Y: run agreed service checks (routing, VLAN/VRF, access to critical systems)
T+Z: buffer for fixes or rollback decision

Roles, communications and emergencies

Assign roles to individuals, not departments. One person should not both implement and accept the work.

implementer: makes changes and keeps a time-stamped action log
service verifier: checks applications/telephony/access and gives “OK”
communications: sends business status updates using a template (start, interim, finished/rolled back)
vendor/integrator on-call: available for escalation (e.g. NOC)
decision owner for "rollback/continue": decides based on pre-agreed triggers

For emergencies decide ahead: who opens a bridge, where is backup access (OOB/console), whether a spare device/module exists, and where the "point of no return" is after which the window is moved.

Common mistakes: what most often breaks an upgrade

The most common reason for failure is upgrading "by chance." The image is copied, the window is short, "we did it this way before." Later you find features missing due to licenses, and the only access was SSH which doesn’t come up after reboot. Minimum prerequisites: verified licenses and working console access (or alternative out-of-band).

Second common issue — no baseline. Without recording config, versions, modules and key counters before the work, after the upgrade it’s unclear whether a problem is new or pre-existing, and it’s hard to prove what changed.

Third group of errors involves HA, stacks and vPC. One switch updating successfully can still break a pair due to mismatched images, different module revisions, or mixed-mode limitations. In those setups confirm both sides support the chosen release and the same installation method.

And one more: "copying the image" is not the same as "switching to it." For IOS-XE and NX-OS the command order, activation confirmation and reboot are critical.

Quick checklist: before start and immediately after upgrade

Risk-free pilot upgrade

We will run a pilot, measure timings and rehearse rollback in advance.

Run a pilot

When the window is open you don’t have time to hunt "where the image is." Below are items you can realistically check in 10 minutes.

Before start:

correct image and checksum (verified on the file on the device or the server)
fresh backup of running-config and key show outputs (versions, modules, licenses, neighbors)
confirmed OOB or console access and working credentials (not only TACACS but a local fallback)
free storage in bootflash and a clear boot path (what will load after reboot)
rollback plan: which image to restore, where it is, and who gives the command "rollback"

Immediately after upgrade don’t try to test everything. First confirm basic operability and resiliency:

ports and aggregates: up/down, errors, speed/duplex, port-channel status
routing and neighbors: OSPF/BGP/ISIS, HSRP/VRRP, routing table and ARP/ND
HA: stack/vPC/VSS/SSO state, roles, sync, no split-brain
service: a few simple tests (ping across VLANs, gateway access, check a key application)
logs: critical messages, crashinfo/core, licensing events

Stop and rollback if the device doesn’t reach the required HA role, key features are missing (e.g. encryption or routing), or interface error counts rise. If the issue is isolated (one neighbor, one VLAN), it may be better to continue troubleshooting with a timer and a clear threshold for returning to the previous release.

Practical example: a straightforward upgrade window without surprises

Typical office: two core switches run as a pair (vPC and HSRP gateway), a border router peers with the provider via BGP and enforces inbound ACLs. The goal is to raise IOS-XE/NX-OS versions without losing connectivity and control.

One week before the window the team assembled the real config and show outputs and compared them with the target release. Critical checks were simple but precise: vPC state (role, peer-link, consistency), active HSRP, BGP neighbor count and routes, ACL drop counters, and that monitoring sees key metrics (interfaces, CPU, logs).

The order was "less critical to more critical." First update the switch that wasn’t currently the active HSRP gateway, ensure vPC came up consistently, then update the second switch. The border router was left last to avoid mixing possible causes.

At the end they prepared a short report:

what was upgraded: models, old and new versions, image checksums
what was checked: vPC/HSRP, BGP, ACL, monitoring, OOB/console access
what changed: log warnings, config diffs, license status
outcome: downtime (if any), deviations from plan, rollback decision (not needed)

Next steps: consolidate the result and prepare for the next release

After the upgrade collect all facts into a single package and note what worked and what required manual actions. Then the next release becomes repeatable.

Prepare the package so any on-call engineer can open it in six months and quickly understand the context:

versions before and after, commands run and their outputs
exact image names, storage locations and hashes
final checklist with marks of who did what
short pilot report and found quirks
rollback plan and triggers: what causes rollback

Update internal docs and templates. If a special module, license or feature dependency showed up, add it to the pre-upgrade template.

If you use a system integrator for upgrades, ask not only for "the device is up" but for a checklist of tested scenarios and acceptance criteria. At GSE.kz (gse.kz) as a system integrator this is usually delivered as a separate test list, plus a pre-agreed rollback plan and access details (OOB/console) for emergencies.