Where 25/100GbE problems usually start

Troubleshooting 25/100GbE almost always begins the same way: you see an unstable link, odd packet loss, or throughput that never gets close to expected. It’s important not to swap modules or cables “at random” — first document the symptoms and collect basic data. That saves hours and often prevents unnecessary replacements.

Common symptoms include:

a flapping link (up/down) or long time to come up after replugging
rising error counters while traffic can still look “normal” at times
packet loss under load
throughput drops, retransmits, and application latency complaints
an intermittent problem that disappears after rerouting

On 25/100G small physical defects quickly turn into noticeable errors. The reason is simple: there is less signal margin and higher requirements for optics and fiber. Dust on the connector, a tight bend, wrong fiber type, poor connector contact or an aging module can cause not a hard outage but a rise in errors (CRC/FEC) that then looks like a “network problem.”

Four things are often confused: a faulty transceiver, a bad patchcord, contamination (connectors and adapters), and a port or configuration issue. For example, a module may be fine but installed on an incompatible fiber type, or the port may be up but configured with an incompatible speed or FEC mode.

Before doing anything, collect at least these facts (put them in a note or ticket):

switch model and port number showing the issue
transceiver type and model (SFP28/QSFP28, SR/LR/CR, vendor)
link length and media type (DAC, MMF/SMF, connector types)
which side shows symptoms (one end or both)
when it started and what was changed last (patchcord, cross-connect, module, settings)

Simple example: a 100G link comes up but errors grow under load. If you immediately change the module, you might “fix” it by chance simply by disturbing a dirty connector. Better to record measurements first and only then touch the physical layer.

What CRC and FEC mean in simple terms

CRC (usually seen as CRC errors or input errors) means frames arrive damaged and fail checksum verification. In Ethernet this is almost always a physical-layer issue: optics, copper, connectors, contamination, bad patchcord, a kink, an incompatible module, or sometimes a faulty port.

FEC (Forward Error Correction) is an in-line “safety net.” A link can look up while FEC quietly corrects symbol errors. So FEC counters may grow even without visible traffic loss or application complaints. This is especially typical for 25/100GbE where signal margin is smaller and optics/fiber quality requirements are higher.

How to interpret CRC and FEC together

The raw numbers alone mean little. What matters is which counter grows and how fast:

FEC grows while CRC stays stable: the link is on the edge but corrections are coping. Often due to low signal level, dirty connector, or wrong fiber type.
CRCs grow (even if FEC also rises): correction is no longer enough and frames are failing. Usually a physical defect or bad mating.
CRCs appear in spikes under load or when the cable is moved: suggestive of a bad patchcord, connector, or mechanical contact.
Counters don’t grow but there are complaints: look higher in the stack for congestion, queue drops, or MTU issues rather than the physical layer.

Why dynamics matter more than a single value

For 25/100GbE it’s more useful to watch growth rate: errors per minute or per hour. Seeing 100 CRCs once doesn’t prove a current problem — it might have happened during a past replug.

Example: after replacing a patchcord FEC continues to grow slowly but CRCs stop increasing. That usually means the situation improved but signal margin is still small; check connector cleanliness and the optics type.

Optical power: how to read Tx/Rx levels

Tx optical power is how much light the module sends into the link. Rx optical power is how much light the module actually receives from the other side. Both values matter: one end can transmit fine while receiving almost nothing, so the issue is not necessarily where you’re looking.

Optical power is usually shown in dBm. This is a logarithmic scale, so the sign matters: -2 dBm is stronger than -8 dBm. It’s easier to stick with dBm and remember that values closer to zero mean more optical power.

Normal ranges: where to get them

Do not guess “normal” from memory. Correct limits depend on optic type (SR, LR, ER, DAC/AOC) and speed (25G, 100G). Use the module’s datasheet: DOM/DDM output often includes thresholds (High/Low Alarm and High/Low Warn) for Rx and sometimes Tx. If thresholds aren’t shown, check the transceiver’s datasheet.

For 25/100GbE this is critical: two modules may look the same but have different allowed ranges.

What mismatches in readings indicate

If Tx is normal but Rx is well below threshold, the link is likely the issue: dirty connector, strong bend, swapped fibers, damaged patchcord, bad splice, or extra adapters.

If Rx “jumps” (sometimes higher, sometimes lower), that looks like an unstable contact: loose latch, a micro-crack in the fiber, or cable tension in a tray.

If Tx and Rx are both marginal (or in warnings) on both ends, it’s usually a budget/compatibility issue: the distance is too long for that optic type or the module class is wrong.

Practical example: on a 100G link Tx was stable but Rx on one side was 6–7 dB lower than expected. After cleaning connectors and reseating the patchcord, Rx returned to normal and CRC errors disappeared without swapping modules.

Short rule before replacing a transceiver:

Compare Rx on both ends and see if they differ significantly.
Check values against the module’s warn/alarm thresholds.
If Rx spikes, start with the connector and cable mechanics.
If levels are marginal, verify the optics type for the distance.
Record Tx/Rx before and after any action to see real changes.

Commands and counters to check on Cisco

Start with one interface and capture a quick snapshot. It’s important not only to see errors but to know whether they are growing now. If counters are stable, the event might have been transient (e.g., during replugging).

Basic commands for a quick picture

Usually this set is enough (command names can vary slightly between IOS XE and NX-OS, but the meaning is the same):

show interface <int>: speed/duplex, up/down, flapping, input/output errors, CRC, drops.
show interfaces <int> counters errors (or equivalent): easier to see CRC, symbol errors, input errors per line.
show interface <int> transceiver details: module type, serial number, DOM sensors (temperature, voltage, current, Tx/Rx power).
show logging (filter by interface manually): messages about link flap, incompatibilities, module errors.
show controllers <int> (if available): low-level physical counters, useful in contentious cases.

For FEC, check whether it’s enabled and if there are corrected/uncorrectable events. Rapid growth of uncorrectable FEC means the link may stay up but traffic will experience losses.

What to record before swapping

To separate “happened” from “ongoing,” save values twice: immediately and after 3–5 minutes under load.

measurement time and interface name (on both ends)
CRC/FCS and input/output errors: absolute count and growth
FEC corrected and uncorrectable: absolute count and growth
DOM: Tx/Rx power, temperature, voltage (compare with expected for your link)
log events: when the flap happened and what the log says about the module

Example: on a 100G link CRCs rise on one end while the other end shows almost zero. At the same time Rx power on the failing side is noticeably lower and FEC uncorrectable counts grow. This often points to a physical problem (connector, contamination, bend, cable) rather than a “bad” switch. In integration projects (including server racks and network gear) documenting this before any action saves hours and prevents blind module swaps.

Step-by-step diagnostic flow without unnecessary replacements

When errors start on 25/100GbE, most time is wasted on chaotic swapping of modules and cables. Use short cycles: gather a baseline, compare both ends, then change only one item at a time.

First record port state: is the link up, what speed and interface type, is there flapping, when was the last up/down? On Cisco it’s convenient to start with show interface <int> and show interface <int> transceiver details.

Always check the line from both ends. What one end transmits (Tx) should be seen as receive (Rx) on the other. If Rx at one end is marginal while the other side looks fine, that’s a clue: the problem may be fiber, connector or the patchcord near the “bad” end.

A practical workflow that usually yields an answer in 10–20 minutes:

Record the baseline: speed, duplex, CRC counters, FEC (corrected/uncorrectable), uptime, Tx/Rx levels.
Compare both ends: error growth should correlate and Tx of one side should explain Rx of the other.
Clear counters and observe a short window of 5–15 minutes: clear counters interface <int> then re-take stats. Don’t mix old and new errors.
Perform minimal swaps one at a time: patchcord first, then module, then port (or move the link to a neighbor).
Confirm results: optics stable, CRC does not grow, uncorrected FEC stays absent, link remains stable.

Example: if replacing a patchcord raises Rx by 2–3 dB and CRC growth stops, the module was almost certainly fine. If Rx is normal but uncorrected FEC appears right after clearing counters, suspect the module or port more than the cable.

In integration and supply scenarios (including support from GSE.kz) this order avoids needless replacements and helps gather evidence for warranty or support.

How to tell a faulty module from a bad cable

When CRC/FEC rise and the link is intermittent, the culprit is usually one of three: patchcord/fiber, dirty connectors, or the transceiver. Less often it’s the specific switch port. The diagnostic idea is simple: see which element “moves” with the problem when you swap it.

Quick signs from Tx/Rx and counters

A bad cable/patchcord usually reveals itself in Rx: the receive level is low and may jump, and touching or bending the cable increases errors. A classic sign is that replacing a short patch between patch panel and port removes the problem even if the module remains.

Dirty or damaged connectors are trickier. Replugging can make Rx worse, and both ends might show rising FEC/CRC and reduced Rx. If cleaning and careful reseating fixes it, the module is likely innocent.

A faulty module more often shows odd DOM behavior: Tx is too low or unstable without external cause, the module runs hot, and error counters grow even with a known-good cable. The key sign is that the problem moves with the module when you place it in another port.

Another case is the port itself. If identical modules and patchcords work everywhere except on one port, suspect the port or its optical front end.

A/B test that gives a clear result

Change only one element at a time and record what you changed:

Use a known-good reference module and patchcord (same type/lot if possible).
Swap patchcords while leaving modules in place. If the problem moves, the patch or connector is guilty.
Restore patchcords and swap modules crosswise. If the issue moves with the module, it’s the module.
If it moves with the port after those tests, it’s the port.
After each step verify not only that the link came up but also the error dynamics and Rx/Tx over the same time interval.

Example: on a 100G link CRCs rose only at night. Rx was marginal and reacted to touching the patch in the rack. Moving modules did nothing, but replacing a short patch between panel and switch stopped errors. Classic case where the mechanical connection, not electronics, was at fault.

Optics and fiber: compatibility and frequent physical causes

In optical links 25/100GbE errors often originate from module/line incompatibility or simple physical issues. If FEC grows and CRCs appear intermittently, start with basic optics and fiber checks — it’s the cheapest diagnostic step.

SR/LR/ER describe medium and distance, not speed. SR is typically multi-mode (MM) at 850 nm, LR is single-mode (SM) at 1310 nm, ER covers longer distances (often 1550 nm) and can be sensitive to power overload on short links. Installing SR on single-mode or vice versa may result in a link that won’t come up or is unstable: Rx will be too low and FEC will try to hide errors until the link eventually flaps.

Polarity is another common issue. On duplex fiber with swapped Tx/Rx you’ll usually see a simple picture: one end transmits while the other receives almost nothing. Sometimes patchcords or adapters introduce an implicit cross and an extra crossover causes intermittent behavior after swapping a short patch.

With 100G MPO/parallel optics, typical problems relate to cassettes and adapters. Wrong type (Type A/B/C), incorrect keying, or wrong pinning (pinned/unpinned) means some lanes don’t receive light. The link might come up but FEC will grow quickly because one or two lanes degrade.

Common physical culprits: tight bends, cable tension, dirt on the ferrule, scratches, or poor connector seating. Even one sharp bend near a rack can cause error spikes when the cabinet vibrates or is opened.

What to check before replacing modules:

Module and fiber type: SR with MM, LR/ER with SM, matching wavelength and distance.
Polarity: are Tx/Rx swapped or is there an unexpected crossover?
For MPO: cassette type, orientation, pinned/unpinned, ferrule cleanliness.
Cable routing: bend radius, tension, kinks at organizers.
Connector cleanliness: even a new patch should be wiped and inspected.

Example: a 100G link came up but after an hour FEC grew and packet loss began. An MPO cassette was Type A while the run expected Type B. Visually normal, but two lanes operated at the edge and any vibration caused errors.

Typical mistakes and traps in diagnostics

The most common mistake is treating rising FEC as “link broken” and immediately swapping modules, patchcords, and ports. FEC indicates the line is noisy but not where the noise comes from. On 25/100GbE FEC can increase due to a dirty connector, a fiber bend, unstable module power, overheating, or incompatible optics.

Another trap is looking at optical power without referencing the specific transceiver’s spec. One transceiver may be fine at -10 dBm Rx while another is already marginal. Don’t compare “like the neighbor port” before checking the allowed ranges for that transceiver type.

A third trap is replacing both ends at once. If you change modules on both sides or swap patch and module simultaneously, you lose the experiment’s control. Change one item at a time and record the result.

Temperature is another source of intermittent errors. Cisco exposes transceiver temperature and sometimes voltage. If the module is hot and rack airflow is poor, errors may appear only under load or in hot hours.

To avoid counter confusion follow this observation rule:

Clear counters only once and record the time
Observe a fixed interval (e.g., 10–15 minutes) under similar load
Compare growth rates of CRC and FEC, not absolute totals
Record Tx/Rx power and temperature at start and end
Change only one factor per step

Example: CRC increasing while FEC is near zero often indicates a hard physical fault (contact, port, cable). FEC growing with low CRC suggests the link is marginal — check optics, cleanliness, bends, and temperature first.

Quick checklist before replacing equipment

When 25/100GbE errors rise and the link is intermittent, the costliest response is to replace everything. This checklist narrows causes fast and preserves baseline data for comparison.

First, capture a snapshot on both ends, including error dynamics: are CRCs growing, is FEC correcting, and what are Tx/Rx levels. On Cisco these commands are usually enough:

show interface <int> counters errors
show interface <int> transceiver details
show interface <int>

Then act in short steps, changing only one thing at a time:

Record current CRC/FEC and Tx/Rx on both devices, plus time and current load (e.g., backup or replication).
Inspect physical setup: module latch, seating, cable bends, tension, and connector cleanliness. Even slight dust on an LC can cause Rx and FEC spikes.
Replace one link element at a time: patchcord (or DAC/AOC) first, then module, then port. If changing a module, move it to another port and compare behavior.
After each change clear counters and watch 5–15 minutes under the same conditions. Without clearing you can mistake old errors for new ones.
Finally, ensure the problem doesn’t return under load: run normal traffic and confirm errors do not accumulate.

A practical guideline: if replacing a patch stabilizes Rx and stops FEC growth, the connector/patch was guilty. If errors move with the module to another port, the transceiver is likely bad. This approach speeds 25/100GbE diagnostics and reduces unnecessary swaps.

A real-world example: finding the cause on a 100G link

A 100G link between two Cisco switches in the same rack caused occasional packet loss and app stalls. Counters showed FEC constantly correcting (corrected steadily rising) and CRC spikes every few minutes.

First they compared Tx/Rx on both ends. Tx looked normal and similar on both modules, but Rx differed: side A had ~-3 dBm (comfortable) while side B’s Rx fluctuated and sometimes dropped near the low alarm. That narrowed the cause: if transmission is stable but reception is marginal, the path (patch, contamination, connector) is usually at fault.

They ran a series of swaps to separate cable and module issues:

swapped patchcords between A and B and rechecked Rx
swapped modules crosswise and compared Rx and error growth
restored originals and inserted a known-good patchcord

Result: after replacing the patchcord the problematic Rx moved with the cable, not the module. With the new patch both sides’ Rx aligned, corrected FEC stopped growing as fast, and CRC spikes disappeared.

To confirm they left the link under load for hours and checked three things: Rx levels no longer jumped, CRC did not increase, and FEC corrected grew slowly or not at all depending on the line and FEC type. This approach avoids swapping expensive QSFP28s blindly and quickly identifies mechanical issues.

Next steps: lock the fix and prevent recurrence

After solving the issue and FEC/CRC are gone, it’s important to lock in the fix. On 25/100GbE problems often return not because something failed again but because someone swapped a patch, inserted a different module, or reconnected a dirty connector.

When to escalate

Escalate if errors persist after replacing patchcord and transceiver, or if symptoms “move” with the port (e.g., any module on that port errors). Also escalate when suspecting the trunk route: unstable Rx power, strong asymmetry across lanes, or issues that appear only with certain cable routing.

Before contacting support gather data to avoid guesswork:

exact device model and software version, port number, and speed (25G/100G)
transceiver model and serial number, plus remote side info if available
DOM readings (Tx/Rx power, temperature, voltage) from both sides at the time of error
error counters (CRC, FEC corrected/uncorrected, symbol/align) over the same interval, e.g., 10–15 minutes under load
what was already swapped or moved (module, patchcord, port, adjacent slot)

How to reduce recurrence risk

Practical discipline helps: standardize patchcords (type, length, vendor), label “port-to-port” clearly, and add a routine for cleaning optical connectors before connecting. If frequent moves occur, store patchcords and modules with dust caps and avoid leaving ferrules exposed.

For a comprehensive approach (line testing, SFP28/QSFP28 compatibility, bench testing and documented integration) GSE.kz can provide vendor and system integrator services so the fix is repeatable and documented.