Jul 21, 2025·7 min

Network for a GPU Cluster: When 25/100GbE Is Enough

Network for a GPU cluster: how to choose 25/100GbE without overspending, where latency matters, and how to validate the choice with load tests before procurement.

Why bother sizing the network instead of just buying "extra"\n\nWhen the network is chosen “by eye,” one of two things usually happens: you either overpay for bandwidth that won’t be used, or you skimp where it later causes GPU idle time and lost development time.\n\nIn a GPU cluster what matters is not only gigabits on a spec sheet but also stability. If the network sometimes “flies” and sometimes drops, training slows down in waves: iterations speed up, then suddenly stretch out. In practice this looks like mysterious performance dips that are hard to explain by compute alone.\n\n“Without overspending” for the network means buying exactly what gives a noticeable improvement for your workloads and not paying for capacity where the network does not limit training time. The expensive part is not just a 100GbE port, but everything around it: switches of the required class and their licenses, optics and cables (including spares), power and cooling, rack space, higher operational expertise, plus compatibility risks and long troubleshooting.\n\nA simple example: a team builds 8 GPU servers and installs the “fastest” connectivity everywhere. After a month they find that most time is spent reading the dataset and occasionally writing checkpoints, while inter-node exchange is rare. Money went into bandwidth that barely speeds up training, while the bottleneck remains storage or software configuration.\n\nOn the other hand, the reverse happens too: they buy a cheaper network, then switch to distributed training and GPUs start waiting for gradient exchange. In such cases calculations and short pre-purchase tests help understand which network is actually needed and where it’s safe to save.\n\nAt datacenter scale this is even more noticeable: as a systems integrator, GSE.kz often sees that the same budget can buy either “expensive ports” or a predictable cluster with a clear growth path.\n\n## What workloads in a GPU cluster generate network traffic\n\nThe network in a GPU cluster is not busy “all the time” but at specific moments and for different reasons. So first it’s useful to understand the profile: training or inference.\n\nDuring inference traffic is most often inbound and outbound to the service: requests, responses, and occasional small lookups. Here stability and predictability matter more than record intra-node bandwidth. If inference is distributed across nodes, the network participates, but usually less aggressively than during training.\n\nIn distributed training the main network consumer is data exchange between GPUs on different servers. These are collective operations (for example, all-reduce) where gradients or model parameters are constantly synchronized. The more GPUs and the larger the model, the heavier and more frequent these exchanges. If exchanges are slow, GPUs wait for each other and training time increases even with powerful accelerators.\n\nA separate story is the data pipeline. Even with well-tuned training, the network can be limited by dataset delivery and preprocessing results: reading from network storage, writing checkpoints and logs, distributing data to workers (data loaders), interaction with caches and queues, and surrounding services like metrics and experiment tracking.\n\nExample: you run training on 8 GPUs across two servers while the dataset sits on network storage. At epoch start workers read data intensively, then the network shifts to gradient synchronization while checkpoints are written in parallel. These peaks reveal the real bottleneck.\n\nIf you plan a cluster and its integration, it’s important to separate flows in advance: GPU-GPU exchange, storage access and service traffic. Then it becomes clear where 25/100GbE is enough and where the network truly determines training speed.\n\n## Bandwidth and latency: what really affects training time\n\nTraining time on GPUs often isn’t limited by the “port speed on a switch” but by how quickly and predictably data gets between nodes. You need to understand what limits your workloads: gigabits (bandwidth) or microseconds (latency and its variation).\n\nWhen a model actively exchanges gradients between multiple GPUs on different servers, you can hit bandwidth limits. A sign is idle GPUs waiting for the network while link utilization stays near the cap for extended periods. Wider channels, lower oversubscription and correct wiring help here.\n\nLatency and jitter matter more when exchanges occur in short bursts and frequently: many small messages, synchronizations and barriers. Then even high bandwidth won’t save you if packets arrive unevenly and processes constantly wait for each other. Graphs often show a normal average latency but rare spikes that hurt iteration time.\n\nDon’t confuse “25/100GbE on a port” with the application-level throughput. It is affected by protocol overheads, MTU settings, switch queues, competing flows and how the training framework packs exchanges.\n\nPacket loss is usually worse than just “slower.” Even fractions of a percent of loss cause retransmits, latency spikes and unstable iteration times. For distributed training this is painful: one lagging node slows everyone.\n\nA practical way to distinguish causes:\n\n- Bandwidth-limited: sustained high link utilization and only modest speedups when adding GPUs.\n- Latency/jitter-limited: moderate link utilization but fluctuating iteration times and degraded synchronization performance.\n- Losses: rare but big performance drops and latency spikes.\n\nIf you plan a cluster, measure not only gigabits but also latency stability under load. This often explains why identical “100GbE” specs deliver different training times in practice.\n\n## When 25GbE is enough and when you need 100GbE\n\nChoose between 25GbE and 100GbE by asking where the network is losing time — between GPUs, between nodes, or on the way to storage. For many pilot and medium clusters 25GbE provides predictable operation without extra cost if workloads don’t hit gradient exchange or collective operations.\n\n25GbE is often sufficient when:\n\n- training runs on 1–2 GPUs per node or across a small number of nodes and scaling doesn’t break when adding a couple more nodes;\n- models and batches are moderate and iteration time hardly changes as GPUs increase;\n- there are few concurrent jobs and no constant contention for the same uplink;\n- storage traffic follows a separate path (or at least doesn’t share the same bottleneck as GPU-GPU traffic).\n\nIf you see GPUs idle while link utilization is close to the limit during synchronizations, 100GbE can deliver a quick benefit even without reworking the whole cluster.\n\n100GbE is usually justified when:\n\n- you plan to grow to tens of GPUs and beyond and want node additions to provide real speedup;\n- distributed training across many nodes is used with frequent all-reduce and other collective operations;\n- many nodes hang off a single switch and uplinks become the main constraint;\n- you need to separate training traffic from data traffic without creating many separate networks.\n\nPlan growth over a 12–18 month horizon: how many nodes, GPUs per node, and concurrent jobs. A useful approach is to calculate a “target configuration” and choose the network so it handles not only average load but also synchronization peaks.\n\nAlso consider storage traffic. If data loaders read large datasets and checkpoints are written concurrently, the uplink can be saturated even if GPU-GPU exchange by itself is moderate. Solutions include dedicating a network path to storage or upgrading to 100GbE for nodes that both train and read data intensively.\n\n## Where low latency is truly critical\n\nLow latency isn’t always required. If exchanges are large blocks and the network is evenly loaded, bandwidth is often the limit. But there are scenarios where latency and its variation directly increase training time.\n\nThe most common case is distributed training with synchronization between nodes. Collective ops like all-reduce force all GPUs to wait for each other. If one node receives data slightly later, the whole group stalls. You may see GPUs not fully utilized even when there’s headroom in gigabits.\n\nLatency hits hardest when there are many small messages. This occurs with some parallelism schemes (many stages of gradient exchange), parameter-server setups, and systems with frequent synchronization barriers. For large messages, extra microseconds are negligible; for small messages they become a noticeable part of step time.\n\nMixed loads complicate things further. When training and, for example, inference or data loading share the same links, queues on ports compete. Even with a normal average latency, tail values (p95–p99) may spike and training steps will jitter.\n\nFinally, a single slow link can throttle the whole group. It isn’t always a failure — a misconfigured port speed, overloaded uplink or unfortunate cabling can do it. Typical signs that latency is critical:\n\n- step time is unstable though GPU and CPU utilization looks normal;\n- efficiency drops more than expected when you add nodes;\n- monitoring shows queues or drops on one or two ports;\n- a slight slowdown on one server slows the entire pool.\n\nIn multi-node setups this is especially evident: the network determines how well GPUs “stick together” as a unit.\n\n## Topology and oversubscription: how not to lose performance in the architecture\n\nThe wiring of a GPU cluster matters as much as the port number. Two clusters with identical 25/100GbE can behave differently if one network is carefully built and the other has frequent hotspots.\n\n### Common layouts and how they differ\n\nThe most common pattern in datacenters is spine-leaf: servers connect to ToR/leaf switches, which converge to spine switches. This fabric scales well and makes capacity forecasting easier.\n\nSometimes people try to save money with rings or cascades where one switch connects to the next. Under real load this often becomes a lottery: one overloaded link slows whole groups and debugging drags on.\n\nOversubscription means the sum of downstream ports to servers exceeds upstream capacity to the fabric. For example, 16 servers at 25GbE give 400Gbps downstream while uplinks provide only 200Gbps. If traffic stays local inside the ToR you may not notice. But when training or inference begins heavy inter-node exchanges, queues, losses and latency spikes appear.\n\n### Where you need a non-blocking fabric and where you can simplify\n\nA non-blocking (or near non-blocking) fabric is required when you expect frequent simultaneous exchanges among many GPU nodes (typical for distributed training). In simpler scenarios moderate oversubscription is acceptable if most traffic stays within a rack or a pair of racks, inter-node exchange happens in short phases, and critical services are separated into a different VLAN or fabric.\n\nBuffers and switch queues play a key role under overload. GPU traffic often comes in microbursts: everyone sends at once for a second and the bottleneck fills immediately. If buffers are small or queues misconfigured, you don’t just get lower throughput — you get instability: rising latency and retransmits.\n\nA practical guideline: estimate expected east-west traffic between nodes, then choose topology and oversubscription for that profile, not for an abstract “safety margin.”\n\n## Step by step: how to define network requirements before procurement\n\nStart simple: you don’t need the “fastest” network but one that doesn’t slow your real workloads and doesn’t waste money on unused capacity.\n\n### 1) Describe scenarios and their network profile\n\nCollect 3–4 typical scenarios and note what happens to data. For example: multi-node training (heavy GPU-GPU exchange), inference (stability and predictability priority), ETL and dataset prep (storage-heavy), loading and checkpoints (traffic peaks).\n\nMake this easy to agree on as a short list:\n\n- which jobs you run most often and how long they last;\n- how many GPUs per job and across how many nodes;\n- where data is read from (local, NAS, object storage);\n- how often checkpoints are written and their sizes;\n- what’s more important: faster training or staying within budget.\n\n### 2) Gather basic metrics and set a target\n\nIf you already have hardware, collect simple KPIs: epoch time, network port utilization, dataset read time, checkpoint duration. If you don’t have hardware, take a pilot server and run a small bench to get order-of-magnitude numbers.\n\nThen fix the target scale for 6–12 months: how many nodes and GPUs you expect and what “acceptable” training time is. Example: now 4 nodes, in a year 12, and training time should not increase by more than 15–20% when scaled.\n\n### 3) Choose link speeds and growth plan\n\nAfter that it’s easier to decide on 25GbE or 100GbE: choose bandwidth to cover specific peaks (inter-node exchange, checkpoints, dataset reads), not speed in the abstract. Build in growth reserve, but plan an upgrade path without full replacement: ports that can be upgraded, spare optics and cables, and a clear spine-leaf expansion path.\n\nIf buying through an integrator, agree upfront how expansion will work. In integrated projects this is usually fixed in the scaling diagram so adding racks doesn’t become a network overhaul.\n\n## Load tests before purchase: how to validate the network in practice\n\nBuying a network “by eye” is risky: you may overspend on 100GbE where 25GbE would do, or hit a bottleneck and lose training time. Load tests turn discussion into numbers: how much traffic actually flows, how latency behaves and what happens under concurrent streams.\n\n### What to measure\n\nNot only peaks matter but behavior under loads similar to expected production:\n\n- throughput: single-stream and multi-stream (aggregate);\n- latency: average and tails (rare large values);\n- losses and retransmits: even small percentages can destroy stability;\n- jitter: latency variation, especially under mixed loads;\n- repeatability: consistent results across runs without drops.\n\n### How to build a test bench\n\nOften 2–8 nodes are enough even if the target cluster will be larger. The key is to reproduce traffic patterns: “all-to-all” for inter-node exchange and a separate stream to storage.\n\nPractical test set: iperf3 for throughput (several parallel streams), ping for basic latency, and if RDMA is available — simple RDMA tests to see CPU behavior and tail latency differences.\n\nA useful test is a combined run: start exchanges between nodes (e.g., 4–8 parallel pairs) and simultaneously stress the storage path. A common situation: each test alone looks fine, but together cause losses and latency spikes.\n\nRecord results in a simple table “expected vs actual”: target Gb/s, acceptable latency, acceptable loss (usually 0), actual values and test conditions (number of streams, duration, packet sizes). That makes it easier to compare 25GbE and 100GbE before buying.\n\n## How to interpret test results and avoid wrong conclusions\n\nIperf-like numbers alone don’t answer the main question: will training run well on this network? So look not only at peak speed but at repeatability and behavior under load.\n\nCompare a test on a “clean” network with one containing background tasks (dataset copies, monitoring, backups). If the clean test looks good but background causes latency spikes or throughput drops, your network or settings may not handle competing streams well. For GPU clusters this is often more important than absolute peak.\n\n### Settings that actually change the picture\n\nBefore blaming hardware check a few settings: MTU, port queues and NIC driver options.\n\n- MTU: if larger MTU increases stable throughput and lowers CPU load, that’s a good sign. If it causes losses or “jagged” latency, revert and debug step by step.\n- Queues and RSS: if one flow is pinned to a single CPU core, a test may show a “bad network” while the issue is on the host.\n- Drivers and firmware: updates can change latency stability even at the same average throughput.\n\n### Warning signs\n\nRepeated symptoms are more worrying than one-off peaks:\n\n- repeated throughput drops of 20–30% or more under the same conditions;\n- unstable latency (orders-of-magnitude spikes), especially with background traffic;\n- packet loss, TCP retransmits or UDP “holes”;\n- strong asymmetry between directions (one way OK, return path bad).\n\nIf tests “fail,” don’t rush to blame the network. Example: copying data from NVMe may be limited by CPU encryption or disk throughput. Check CPU load, disk speeds and whether the application makes extra memory copies.\n\nIt’s practical to record results in two modes: “network as-is” and “network as it will be in production” (with parallel tasks). That clarifies whether 25/100GbE is enough or whether problems stem from stability and configuration rather than raw bandwidth.\n\n## Common mistakes when choosing a network for a GPU cluster\n\nThe priciest mistake is buying the network “at max” without checking where time actually goes. The budget burns on ports while training barely speeds up.\n\nTypical situations:\n\n- Buying 100GbE “everywhere” though the constraint is storage, CPU or software stack. Example: nodes with 8 GPUs pull data over 10GbE from storage, so speeding inter-node links barely changes epoch time.\n- Allowing too high oversubscription and getting random bottlenecks. On paper aggregate bandwidth looks fine, but in peaks several nodes hit the same bottleneck and training alternates between fast and slow.\n- Looking only at average latency and missing jitter and losses under load. Even rare micro-losses or latency spikes create long tails in gradient sync and show up as strange GPU pauses.\n- Not planning growth. Adding 4–8 nodes can change east-west traffic and make the network unpredictable even if port counts and gigabits seem sufficient.\n- Testing synthetic scenarios instead of your real workload. A single-stream iperf or a short overnight test does not reveal behavior for your batch size, GPU count and real I/O pattern.\n\nA useful rule: if you cannot verbally describe your heaviest mode (how many nodes, training type, where data comes from, peak profile), choosing hardware is guesswork.\n\nSolve this by discipline before purchase: fix 1–2 representative training jobs, run them on a test bench and evaluate not only speed but stability. If you build the cluster with an integrator, include such live acceptance tests in the contract rather than relying only on switch datasheet numbers.\n\n## Short checklist and next steps\n\nWhen selecting a network for a GPU cluster, the most important thing is to specify real flows: GPU-GPU, storage and service streams (logging, metrics, control).\n\nA short checklist to quickly filter out excess and remember the critical items:\n\n- Speeds: what share of traffic is east-west (between nodes) vs north-south (to storage).\n- 25GbE or 100GbE: how many concurrent streams will peak. One heavy job with many GPUs can saturate a 25GbE uplink faster than several small tasks.\n- Latency: which tasks are sensitive to micro-latencies. For distributed training watch not only average but tail metrics (p95/p99).\n- Topology and oversubscription: what uplink ratio are you willing to accept.\n- Load testing: run throughput, latency and loss tests under parallel load (not one iperf stream, but multiple streams plus storage-like background).\n\nTo make requirements usable in a tender and for acceptance, describe them as measurable criteria: target throughput between nodes with N parallel streams, acceptable p95/p99 latencies for message size X, zero packet loss under given load, and a validated design (number of leaf/spine, port speeds, oversubscription ratio).\n\nNext step — run a pilot with your scenarios: execute representative jobs and synthetic tests on the chosen topology and speeds. It’s convenient to do this with an integrator, for example GSE.kz (gse.kz), so servers, switches and settings are chosen for the target load rather than market averages.

FAQ

Why not just buy a network with extra headroom and skip the math?

It’s better to calculate because “extra headroom” is often bought in places that don’t speed up training. You can spend the budget on 100GbE while the real bottleneck is storage, data-loader settings or checkpointing. A calculation and a short pilot help buy exactly what gives a noticeable speedup for your scenarios.

What usually puts the heaviest load on the network in a GPU cluster?

Main sources are inter-node exchange during distributed training (gradient synchronization), data access (reading datasets, feeding batches to workers), writing checkpoints and logs, and service traffic like metrics and experiment tracking. Different phases of training spike different flows, so it’s important to separate them in the design.

How do I tell if I’m limited by bandwidth or by latency?

If GPUs sit idle while the link utilization is near maximum during synchronizations, you’re hitting bandwidth. If network load is moderate but iteration times fluctuate and worsen at barriers, latency and jitter are likely the cause. Packet losses usually appear as rare but severe performance drops.

In which cases is 25GbE really sufficient?

25GbE is typically enough for small setups and workloads where inter-node exchange doesn’t dominate iteration time, and data isn’t continuously streamed from network storage. This is common for pilots, modest models and a small number of nodes—as long as storage traffic does not congest the same uplink used for synchronizations.

When does moving to 100GbE produce a noticeable effect?

100GbE pays off when you plan multi-node distributed training with frequent collective operations and want node additions to yield real speedup. It also helps when nodes simultaneously read datasets and exchange gradients and you observe link saturation at peaks. Keep in mind the cost grows not only for ports but also for switches, optics, power, cooling and maintenance.

Where is low latency truly more important than raw gigabits?

Latency matters when there are many short synchronizations and small messages, and the whole group must wait for the slowest node. Even with normal average bandwidth, tail latencies (p95/p99) can stretch training steps. This is especially visible under mixed loads, when storage or service traffic shares the same links.

Why can topology and oversubscription "eat" all the performance?

Topology determines how often traffic hits bottlenecks and how queues form on ports. With high oversubscription you can have plenty of downstream bandwidth but too little upstream capacity, so inter-node exchanges create queues, jitter and sometimes packet loss. A tidy spine-leaf fabric makes cluster behavior more predictable and easier to troubleshoot.

What tests should I run before buying the network?

At minimum, measure throughput with one and many parallel streams, latency and its tails, and packet loss and retransmits under load. Run tests not only on a “clean” network but also with background traffic resembling reality: simultaneous dataset reads and checkpoint writes. Repeatability matters: if results vary between runs, you’ll see the same variability in production.

How to avoid mistakes when interpreting iperf or similar tests?

Don’t judge by a single synthetic peak — stability under parallel streams is more important. If latency spikes or throughput drops when background traffic is present, check MTU, queues, NIC settings, CPU load and disk subsystem limits. Often what looks like a bad network is actually a host, driver or I/O bottleneck.

What are the most frequent errors when choosing a network for a GPU cluster?

Common mistakes: buying maximum speed everywhere while the real bottleneck is storage or software; allowing too much oversubscription so random hot spots appear; paying attention only to average latency while ignoring jitter and packet loss under load. When integrating a cluster, it’s useful to lock measurable acceptance criteria and pilot scenarios so the system scales predictably without redesign.

How to make requirements that pass tender and acceptance tests?

Describe measurable criteria: target throughput between nodes with N parallel streams, acceptable p95/p99 latencies for message size X, zero packet loss under specified load, and a confirmed topology (number of leaf/spine, port speeds, oversubscription ratio). Then run a pilot with your real workloads and synthetic tests to validate the chosen topology and speeds. Working with an integrator like GSE.kz (gse.kz) helps align servers, switches and settings to the actual load.