Why does vSAN sometimes get slower after adding nodes instead of faster?

A plan prevents cluster growth from worsening latency due to vSAN background tasks. After adding nodes, replication traffic changes, resync and rebalance start, and a weak link (network, cache, capacity disks, or storage policy) can suddenly become critical, even when the UI looks "green".

What should I start with when planning ThinkAgile VX before purchase?

Start with the workload profile: which applications, expected peaks, growth over 12–24 months, and RPO/RTO requirements. Then agree on priorities (latency, IOPS, throughput, or capacity) and choose nodes, disk groups and network to match those priorities.

What early signs show a cluster is starting to degrade?

Watch the 95th-percentile read/write latency, duration and frequency of resync/rebalance, and response stability during peak hours. If latency rises when new VMs are added or after small changes, the network, disks or storage policies are already close to their limits.

If nodes are “vSAN Ready”, why can they behave differently?

vSAN Ready means compatibility, not identical performance across all scenarios. Performance varies with SSD type and class, number and size of disk groups, cache-disk endurance, and actual network stability between racks and switches.

When should I choose all-flash and when hybrid in vSAN?

Hybrid (SSD cache + HDD capacity) makes sense when cheap terabytes matter and the workload is mainly sequential. For VDI, databases and most active virtualized services, all-flash is usually better: lower latency, more predictable response, and easier growth planning.

Why is the vSAN cache disk more important than it seems, and how do I tell if it’s insufficient?

Cache in vSAN is more about stable low latency and endurance than raw size, because it constantly receives writes. If cache is tight or some nodes have slower cache, tail latencies (p95/p99) worsen and users notice delays even if averages look normal.

What network is required for vSAN and when is 10GbE no longer enough?

For vSAN you need not only bandwidth but also low latency, no packet loss and no oversubscription on uplinks or inter-switch links. 10GbE can work for small clusters, but as nodes grow and resync activity increases the network often becomes the first bottleneck — upgrading to 25GbE usually brings predictable improvements.

Can I expand a cluster with nodes that have different firmware and drivers?

Define a single target combination of vSphere/vSAN versions, drivers and firmware, and add new nodes only in that profile. Mixed versions often produce odd symptoms: unstable performance, update problems, or unexpected network/disk degradation.

Which metrics must be measured before and after cluster expansion?

Capture a baseline and compare after expansion: read/write latencies, IOPS and throughput, vSAN network load and packet loss, CPU Ready, and presence/duration of resync. If latency rises while resync is long and network/disks are near saturation, fix the bottleneck first.

How to add nodes safely to avoid outages and a resync avalanche?

Make changes one at a time and let the cluster finish resync and stabilize before the next step. The safest approach is identical nodes and disk groups, agreed storage policies and clear stop conditions for expansion; involve an integrator if needed to keep design and support consistent, for example GSE.kz (gse.kz).

Deploying Lenovo ThinkAgile VX: nodes, network and disks for vSAN

Why you need a plan before purchase and scaling the cluster

A common story: a cluster runs well on 3–4 nodes. Then someone adds 2–4 more nodes while the number of virtual machines grows. The expectation is simple: more resources mean better performance. In practice you can get the opposite: latency rises, users complain, and the admin sees that everything is “green”.

vSAN is sensitive to balance. After expansion, traffic patterns change (replication, resync and data rebalance), background load grows, and a weak spot stands out quickly. So a plan is required not only before buying hardware but before every growth step.

Another important point: the vSAN Ready mark (including Lenovo ThinkAgile VX) means compatibility, not identical speed in all profiles. Two nodes that look similar can behave differently due to disk types, disk-group sizes, storage policy settings and even how the network between racks is organized.

Most often one of four areas fails first:

network (lack of bandwidth, latency spikes, packet loss)
cache (too small or quickly saturated on writes)
capacity disks (insufficient IOPS, growing latency, inadequate endurance)
storage policy (too “heavy” for current resources, e.g. high redundancy on a small cluster)

Catch early signs of degradation before the issue becomes widespread. Typical signals are growing write latency, spikes in resync/rebalance, unstable response times during peak hours, and notable slowdowns after powering on new VMs or adding nodes.

A good ThinkAgile VX deployment plan documents baseline metrics, sets clear expansion rules, and checks in advance that network, disks and policies will “survive” growth. This saves weeks of investigating “why it got worse after the upgrade”.

Gather workload and growth requirements for 12–24 months

Planning starts not with hardware but with what you will actually run in the cluster. For Lenovo ThinkAgile VX this is especially important: vSAN scales well, but only if workload profile and growth are considered beforehand.

Classify services by type. VDI usually produces many small reads/writes and is latency-sensitive. Databases often require stable IOPS and low write latency. File services and backup repositories tend to care about capacity and throughput. Microservices can generate unpredictable spikes and mixed profiles.

Then honestly set priorities: lowest latency, maximum IOPS, high throughput or capacity. These are not abstract words but future trade-offs when choosing disks, disk-group sizing and network. Without fixed priorities, you can build a cluster that “works overall” but slides after the first expansions.

To align with system owners, a short checklist is enough:

which applications will run in the cluster and how load is distributed
expected peaks (VDI mornings, month-end processing, nightly backups)
how many months of growth you plan and the target node count
requirements for fault tolerance and maintenance without downtime
target RPO/RTO and what will be considered downtime

Example: today you have 4 nodes, 400 virtual desktops and two small databases. In 18 months you expect 700 VDI and databases doubling. You need headroom for performance and capacity, and a decision in advance whether the cluster can survive the loss of one node without noticeable degradation of critical services and still meet RPO/RTO during maintenance and failures.

Architecture and compatibility: versions, policies, failure domains

A frequent cause of problems after expanding a ThinkAgile VX cluster is not “weak hardware” but mismatched versions and different expectations for fault tolerance. Before purchase and before adding nodes, lock down the baseline architecture: supported vSphere and vSAN versions, standard storage policies and the boundaries of failure domains.

Start with compatibility. Ensure vSphere, vSAN, NIC and controller drivers and firmware are in a supported combination. In mixed clusters (some nodes on new firmware, others on old) you risk instability, degraded network or disk behavior, and sometimes inability to update without downtime. A simple practice: one target version, one agreed set of drivers and firmware, and a rule that new nodes join only in that profile.

Decide failure domains in advance: are you protecting against a disk, node, rack or an entire zone failure? If you have two power feeds and two racks, spread nodes across racks and set failure domains by rack; otherwise a single incident can take out half the data components.

Make default storage policies clear and consistent for teams. For example:

critical VMs: FTT=1 on RAID-1 (higher reliability, more capacity consumption)
capacity-heavy workloads: FTT=1 on RAID-5 (more economical, but watch latency and resource headroom)
compression and deduplication: enable only if CPU and disks can handle the extra load

Example: a 4-node cluster with FTT=1 and RAID-5 may work normally, but during aggressive resync after adding nodes 5–6, latency can “drift” if you didn’t provision network and disk headroom or limit background operations.

An update/maintenance plan is also part of architecture. Decide in advance how you will do rolling upgrades, how you replace disks without triggering heavy rebuilds, and the minimum free capacity to keep so expansions and repairs don’t turn into multi-day resyncs.

How to choose ThinkAgile VX nodes for your workload profile

Choosing nodes for vSAN starts with what your VMs actually do: how many vCPUs, how much memory, the I/O profile (many small operations or occasional large writes), and how important predictable latency is.

For CPU and RAM, size according to current utilization plus headroom. A cluster almost always needs spare capacity for node failure and growth. Otherwise on failure or maintenance you may hit memory or CPU limits, not storage. A practical rule: sum vCPU and RAM of all VMs, then add headroom for growth and an N+1 scenario (survive one node loss without emergency measures).

Avoid imbalance between compute and storage. If you have lots of capacity but little CPU, you’ll end up with a storage-heavy cluster that processes requests slowly. If you have plenty of CPU but disks and network don’t keep up, VMs will wait on I/O. Ensure each node has resources not only for VMs but for vSAN background tasks (resync, rebalance).

Keep node generations as uniform as possible to avoid a “zoo.” Different batches are acceptable, but then set limits: weaker nodes become a performance ceiling and may restrict storage policies.

The minimum vSAN base is usually 3 nodes (2-node setups require a witness). As you grow to 6–8+ nodes, background data movements increase, node imbalance becomes more visible, differences in hardware generations matter more, and headroom for network and CPU during resync becomes more valuable.

Example: you have 120 VMs, most with 2–4 vCPU and 8–16 GB RAM, plus some heavy reporting servers. If you pick nodes sized only for the average VM, later growth will hit memory limits and fast disks won’t help. It’s better to provision enough RAM per socket now and plan to add identical nodes in batches.

Disks and disk groups: cache, capacity and endurance

In vSAN, disks matter more than you might expect. Even with suitable nodes and network, bottlenecks often emerge in disk groups. Before deploying ThinkAgile VX, agree on acceptable latency and the headroom you will leave for growth.

All-flash or hybrid: what is more practical

Hybrid (SSD for cache + HDD for capacity) is a common choice when cost per terabyte is the priority and workloads are largely sequential: archives, backups, cold files. For VDI, databases and active virtualization, all-flash usually wins: lower latency, more predictable response and easier growth planning. Hybrid often fails not at start but later, when the cluster grows and background operations (rebuild, rebalance) compete with production workloads.

How to pick cache and capacity

The cache device in vSAN constantly accepts writes and serves hot data. More important than raw size are consistent low latency and high endurance. Capacity disks can be chosen by capacity, but they must also support the needed I/O profile.

Before buying, check vendor specs and, if possible, vendor tests for:

endurance (DWPD or total write resource) for cache devices and projected writes for 12–24 months
NAND type and SSD purpose: read-intensive models are often unsuitable for cache
latency under load, not just “ideal” figures: stability without rare large spikes is crucial
behavior when a disk fills: some models slow dramatically beyond a certain fill level
uniformity of disks within a disk group (especially cache) to avoid uneven performance

Mixing different SSD models almost always yields unstable response: one disk replies fast, another pauses occasionally, and the whole group suffers. A typical scenario: after expansion faster SSDs were added, some data moved, and users experienced micro-freezes due to inconsistent latency.

Practical rule: prefer fewer SKUs and clear endurance headroom rather than “assembling from leftovers” and then hunting for the cause of unstable response.

Network for vSAN: bandwidth, latency and stability

Safe expansion plan

We’ll create a scaling plan: maintenance windows, alert thresholds and change order.

Request the plan

vSAN lives and dies by the network. Disks can be fast, but if inter-node traffic hits an uplink limit or suffers loss, you will see higher latency and performance drops. So plan network headroom for not only today but after node additions.

How much bandwidth is needed and when to move to 25/40/100GbE

Minimum options may work only for small clusters and moderate workloads. As VMs grow, write share increases or intensive replications appear (FTT=1/2, encryption, active snapshots), the network becomes a bottleneck. A simple guideline: if on 10GbE you regularly see high vSAN background traffic and uplink utilization near the limit during peak hours, further expansion will probably worsen latency.

Moving to 25GbE is usually justified when you plan node growth, more disk groups, larger rebuild/resync volumes, or need to get through peak events without business impact. 40/100GbE is for very large environments, dense consolidation and strict latency requirements.

Practical principles that help

Reduce variability to zero: same VLAN, MTU and reservation logic on all hosts. Otherwise some errors will appear only under load after expansion.

Before launch and before adding nodes, verify:

a dedicated VLAN for vSAN and predictable switch rules
consistent MTU across the path (host, vSwitch, uplink, switch). MTU mismatches often manifest as random packet loss
two uplinks per host and control of oversubscription on uplinks and inter-switch links
identical NIC models (or strictly compatible ones), identical drivers and firmware
consistent offload settings so one host does not behave differently

A common scenario: a 4-node cluster on 10GbE runs fine, but after growing to 8 nodes prolonged resyncs start and write latency “jumps.” The cause is often an inter-switch link or uplink sized for current traffic without headroom. During rebuild/resync the network fills up, queues grow and packet loss appears.

Checks before launch: what to measure and record

Before moving workloads to production (and especially before cluster expansion), capture a baseline: how the system behaves now and how it should behave after. Without this, hidden overloads are easy to miss and you’ll end up “looking for culprits” when the cluster grows and slows down.

Capture baseline metrics for later comparison

Record metrics during normal business hours and during peak windows. Capture not only averages but also the 95th percentile to see rare drops.

CPU Ready and overall CPU load per host
storage latencies (read/write) at VM and datastore level
IOPS and throughput for key VMs (read/write profile)
network latency and packet loss on vSAN paths
queues and signs of congestion on hosts and disks

Then check vSAN facts: Health (critical checks), absence of prolonged resyncs, evenness of disk groups, and no persistent warnings on network or controllers. If resync takes hours after small changes, after expansion this often becomes chronic degradation.

Load testing: don’t “measure emptiness”

Synthetic tests are useful only if they resemble real workload. For VDI, stable latencies and many small ops matter; for backups, throughput matters. Run tests on a warmed cache and with enough threads, otherwise you’ll get impressive numbers that don’t reflect reality.

Finally, create a "cluster passport" to compare before/after when adding nodes:

ESXi/vCenter and vSAN versions, enabled features and storage policies
BIOS, HBA/RAID, NIC firmware and drivers (and their sources)
node composition and disk groups (cache/capacity, wear levels)
vSAN network diagram (port speeds, MTU, VLANs, redundancy)
thresholds: at which latency/CPU Ready/resync durations this becomes an incident

A simple guideline: 5–10 minutes of resync after planned tasks is usually acceptable. If resync is constant and grows after adding 1–2 VMs, stop and remove the root cause before moving to production.

Step by step: commissioning and safe expansion

vSAN network audit

We will check MTU, packet loss, uplinks and network bottlenecks for vSAN before adding nodes.

Order an audit

A successful ThinkAgile VX deployment often depends not on adding resources but on controlling vSAN background processes: resync, balancing and data re-layout after changes.

Adding a node without surprises

Before expansion, record the baseline: current latencies, IOPS, capacity usage and average resync time after maintenance. This gives a simple indicator: is it better, the same or worse.

Before adding a node, check at minimum:

ESXi/vSAN versions and firmware (BIOS, HBA/RAID, NIC) match the cluster
no vSAN network overload and no MTU errors, drops or CRCs
enough free space for policies (FTT, RAID-1/5/6) with headroom
no ongoing or minimal resync
alert thresholds are configured and responsible people see them

During node admission confirm it connects to vSAN networks and storage correctly. Watch resync and ensure the cluster doesn’t enter a latency peak. If resync grows too fast and impacts production, temporarily throttle it and move heavy operations to off-hours.

After addition wait for stabilization: resync completion and component distribution leveling. Run rebalance only when the system is calm, otherwise you create double load from both resync and data movement.

Expanding disks, cache and planning windows

When increasing capacity, avoid mixing many changes in one window. Either expand disk groups on existing nodes first or add new nodes in batches, but don’t do everything at once. Increase cache when you see write buffer pressure or sustained write-latency spikes.

For critical systems plan short windows with clear rollback: one change at a time, check metrics, then proceed. If firmware or network compatibility is uncertain, hire an integrator experienced in vSAN rather than troubleshooting in production.

Common mistakes that reduce performance after growth

Post-expansion issues usually expose earlier assumptions rather than being caused solely by added nodes. On vSAN this is especially visible: more active data synchronization, changed workload profile, and the weak link becomes obvious.

The most common mistake is underestimating the network. On a small cluster, unstable latency or NICs running at limit may be tolerable. After growth, resync traffic increases and the network becomes the main limiter: latency and queues grow, and packet drops appear.

Second mistake is wrong FTT/RAID choices. People often set conservative redundancy “just in case” and then are surprised by write performance degradation. Some FTT/RAID combinations increase write amplification and with it CPU, network and disk load.

Third mistake is tight or inconsistent cache. If some nodes have smaller or slower cache, placement and component distribution hit the weakest disk groups. Average metrics may look fine, but tail latencies (p95/p99) worsen and users feel it.

Fourth mistake is ignoring background tasks: resync, rebuild, rebalance. After adding nodes, vSAN actively reshuffles data, and if capacity and IOPS were planned without headroom, background ops eat resources from production VMs.

Fifth mistake is firmware and driver drift across hardware batches. In practice this causes strange symptoms from performance instability to rare controller errors.

To avoid degradation after growth: lock down firmware/drivers, align MTU/VLAN/NIC settings, check network headroom for latency and bandwidth, recalculate storage policies against actual write profile and keep capacity/IOPS reserves for background tasks. Keep disk groups of the same class on each node.

Example: a cluster grew from 4 to 8 nodes and users started seeing morning “freezes.” The cause was not the new node but simultaneous rebalance and an aggressive FTT policy that greatly increased writes, combined with 10GbE network without latency headroom.

Short health checklist after launch and during growth

It’s easier to check vSAN health consistently: right after launch, before adding nodes and after expansion. The idea is simple: record the “normal” state so deviations are obvious.

Quick check after launch

Go through five areas and save results in a log (date, who checked, numbers, screenshots):

network: packet loss and latency between hosts, consistent MTU along the path, real port speeds and uplink load during peaks
disks: read/write latency, SMART errors, SSD wear, controller queues, even load across disk groups
vSAN: active resyncs (or expected ones after changes), Health without critical errors, no signs of congestion, enough free space
hosts: CPU Ready and Co-Stop (if applicable), memory pressure, swapping, HBA/RAID queue and latency overload
processes: update routines (order, windows, rollback), change log, capacity plan for 12–24 months

Check before and after expansion

Before adding nodes compare current metrics to the baseline. After expansion repeat the same checks and separately evaluate how fast the cluster redistributes components.

If you see rising latency together with active resync, that can be normal during reconstruction. It’s problematic when latency increases while network and disks are already near saturation and free space is low. Then expansion will only aggravate the issue: remove the bottleneck first.

Example scenario: growing a cluster from 4 to 8 nodes without surprises

vSAN configuration sizing

Get a calculation of nodes, disks and network for your vSAN profile for 12–24 months.

Request a sizing

Initial state: 4 Lenovo ThinkAgile VX nodes (vSAN Ready) running critical VMs: a database, several applications and VDI. Business plans call for 8 nodes in a year. Risk: just adding capacity while keeping the same cache and 10GbE for vSAN easily leads to doubled load without sufficient IO or network headroom.

Typical mistake: buying 4 more nodes with large capacity drives but leaving cache small and low-endurance and keeping 10GbE ports for vSAN. The first weeks may be fine, then rebuild/resync activity causes latency spikes. Users experience stalls even though “more hardware” was added.

A working scale plan includes more than capacity growth:

vSAN network: plan required bandwidth (often 25GbE) and identical configuration across 8 nodes; verify MTU, loss and stability
disks: choose cache with IOPS and endurance headroom and make disk groups uniform by profile (don’t mix old and new without understanding implications)
vSAN policies: fix FTT/FTM, stripe requirements and IOPS limits for noisy VMs
expansion order: add nodes in batches (for example +2), let the cluster finish resync before continuing

After expansion to 8 nodes confirm there is no degradation by numbers. Typical comparisons:

average and p95 read/write latencies (vSAN and guest OS)
IOPS and throughput on VMkernel vSAN, no packet loss
resync duration and volume, no persistent resync tails
cache fill, write buffer pressure, dedupe/compression stats (if enabled)

If these metrics are OK, growth from 4 to 8 nodes is smooth: new nodes add performance rather than creating bottlenecks.

Next steps: prepare data and move quickly to deployment

To deploy Lenovo ThinkAgile VX without surprises, start with accurate input data. Projects slow down not because of hardware but because no one documented current workload and expected growth.

Collect facts on workload and 12–24 month growth: how many VMs, which applications, IOPS and latency peaks, used capacity and growth rate. Capture current CPU, RAM, network and storage metrics so you know where bottlenecks appear.

Agree on storage goals: what fault tolerance you want (e.g. survive a node failure), required availability level, and whether you’re willing to pay capacity for higher reliability. Decide how you will expand: identical nodes or allow mixed profiles.

Before procurement and launch run a joint design review: nodes, network, disks, versions, update procedures and expansion plan. It’s convenient to document this in a short set of artifacts:

workload table and growth forecast
chosen storage policies and latency limits
agreed node, network and disk-group configuration
test plan and acceptance criteria before production

If an independent view is needed, involve a systems integrator for assessment and expansion planning. For example, GSE.kz (gse.kz) can assist with compatibility checks, infrastructure design and 24/7 support so cluster growth does not become a performance and manageability problem.