Pilot tests for server compatibility with a hypervisor
Pilot tests for server compatibility with a hypervisor: what to check in network, disks, RAID/HBA, power and live migration so you can confidently buy a batch.

What to check and why it matters
You need pilot tests because "it boots and installs" does not mean "it will run in production for years." Before buying a large batch, verify that the exact server model, firmware and drivers behave predictably with your hypervisor and configuration set.
In practice, a compatibility pilot checks four areas: functionality, performance, stability and manageability.
Functionality means support for required modes and features (VLAN/LACP, passthrough, multipath, UEFI/Secure Boot, RAID/HBA operation). Performance means real throughput and latency figures, not just datasheet numbers. Stability means absence of hangs, driver errors, unexpected reboots and degradation under load. Manageability means the ability to service nodes remotely (BMC), update firmware and quickly recover after failures.
Issues usually surface closer to production when load is mixed and constant. For example: the network keeps "speed" but drops packets at peaks; a RAID controller passes synthetic tests but latency grows under sustained random I/O; live migration works until encryption or jumbo frames are enabled; a BMC is "visible" but times out during parallel operations.
To make a confident procurement decision, results must be measurable and repeatable. Define in advance what counts as "pass":
- versions of hypervisor, BIOS/BMC, firmware and drivers
- metrics: IOPS, latency (p95/p99), throughput, rebuild time, migration time
- failure scenarios: link loss, disk out, node reboot, power loss
- stability criteria: no critical errors in logs and no spontaneous reboots
- repeat runs at least twice with identical results
If you plan a cluster for a government organization and buy servers from a local vendor (for example, GSE.kz), record not only "compatible" but the exact set of versions and settings. That simplifies operations and support conversations.
Test-stand and baseline conditions
A pilot is useful only if the test stand resembles the intended production topology and all conditions are documented. Otherwise you measure random settings, not the server.
A minimal stand is assembled to reveal common issues in network, disks and failures:
- 2–3 identical hosts (preferably from the same batch and with the same options)
- 2 switches for redundancy (separate uplinks, separate VLANs if needed)
- shared storage for migration tests (if the project won’t use shared storage, test local disks and your replication method)
- a separate management network (BMC/admin) so it’s not mixed with VM traffic
Most important next step is to freeze versions and configuration. A common mistake: one host has updated firmware and the other doesn’t, producing inconsistent results without an obvious cause.
What to record before starting
Put these in a single template (a table) and don’t change them during the pilot without noting the reason:
- BIOS/UEFI, BMC, RAID/HBA firmware, and NIC driver versions
- power profile (performance/energy saving), CPU frequencies, Turbo, C-states
- NUMA settings and memory allocation (especially for large VMs)
- SR-IOV or other offload functions if you plan to use them in production
Define the metrics you will defend when presenting procurement results. Typical metrics include network latency and throughput, disk IOPS and latency under load, recovery time after failure, and percentage of successful migrations.
Keep logging simple and repeatable: enable centralized collection of hypervisor system logs and BMC events, save host configs, capture metrics at consistent intervals and note every change (firmware, cable, port, setting).
Example: if you take 3 nodes for a future GSE cluster, ask that they arrive with identical BIOS/BMC versions and identical NICs. That helps quickly locate whether an issue is hardware, firmware or hypervisor settings.
Network: functionality, performance and resilience
Network issues in virtualization often look like "random freezes": migrations fail intermittently, storage disconnects, or VMs lose connectivity. Include checks not only for speed but for settings that commonly break clusters.
First, ensure the hypervisor correctly sees NICs and uses the proper drivers. Verify key features on all required networks (management, storage, live migration): VLAN tagging, LACP/aggregation, jumbo frames and offload features. Offloads can show great benchmark numbers but cause problems under real load.
Basic checks to do first:
- confirm MTU matches across the chain: host, vSwitch, physical switch and uplinks (no silent fragmentation)
- check for configuration errors: swapped VLANs, duplicate IPs, wrong gateways, asymmetric rules
- ensure management, storage and migration networks are separated and cannot be accidentally mixed
Then measure throughput and stability under load: it’s not just gigabits per second that matter, but behavior with many flows and small packets. A good sign is steady speed without sharp drops or latency spikes during prolonged runs.
Test resilience as it happens in real life:
- pull one link/port from an LACP bundle and see if sessions drop
- reboot a switch (or place it into maintenance) and measure recovery time
- change the active uplink and verify VLANs and MTU remain intact
Simple scenario: two hosts in a cluster where one runs VM load and a live migration is in progress. During migration, disconnect one physical port. If migration fails and VM network drops, root causes are often LACP, MTU or network separation. Catch these before buying a batch—especially for common configurations that integrators like GSE.kz deploy into government and corporate environments.
Disks and storage I/O: how to avoid surprises
Disk problems often don’t appear on day one but after a week: speed drops, latency grows, errors appear under sustained load. Include long runs, not just quick benchmarks.
Start with inventory. Record exact drive models, type (SSD/HDD, SATA/SAS/NVMe), firmware, rated endurance (TBW/DWPD for SSDs) and supported modes. The batch you buy later must match what you tested.
Then test performance as in real life: not only sequential read but random small-block access. Run the same profiles across several configurations (for example RAID10 vs RAID5, different SSD models) to see honest differences.
A short mandatory test list:
- read/write: sequential and random, block sizes 4K, 8K, 64K, 1M
- latency and IOPS stability as queue depth increases
- long run of 8–24 hours: drops, SSD throttling, rising errors
- SMART monitoring: remaps, media errors, wear level trends
- behavior during degradation: one disk failing, rebuild, impact on VMs
Look for predictability rather than peak numbers. For example, if write latency on a two-node stand (e.g., S200 Series) spikes every 10–15 minutes, investigate controller cache, power-saving settings and drive temperature before buying a larger batch.
RAID and HBA: check modes, rebuild and failure scenarios
Choose RAID controller vs HBA/passthrough based on hypervisor and storage model, not habit. For local datastores, hardware RAID is often convenient. For SDS where the platform manages disks, HBA in IT/JBOD mode or passthrough is typical.
Ensure the hypervisor not only sees the controller and disks but that expected features work. Check queue depths, controller cache, write policy (write-back/write-through) and cache protection (battery/supercap) if using write-back.
Minimum checks before procurement:
- controller mode: RAID vs HBA/passthrough; all disks visible and SMART/statuses reported
- cache policies: enable write-back only when safe; settings should not reset after reboot
- performance and latency: compare random read/write and mixed load on a clean array and under background tasks
- events and alerts: disk errors, array degradation and recovery must appear in logs and monitoring
Run rebuild tests separately. It’s not enough that rebuild "starts"—observe what happens to VMs during rebuild. Run typical VM load and simulate a disk failure. Record rebuild time, dependence on disk size and RAID level, IOPS drop and latency rise, and VM behavior (timeouts, filesystem errors, hangs).
Perform failure scenarios manually: pull a disk, insert a new one, verify auto-resume, correct assignment to the array, no "foreign configuration" confusion and clear recovery steps. Also lock firmware versions: controller firmware and hypervisor driver must be compatible. Finding rare bugs in pilot is far cheaper than in production—this is true even for server platforms like GSE S200 Series.
Power management and BMC: ensure remote control works
Even if CPU and disk checks pass, BMC problems often appear later: missing sensors, remote console failures, or lost access after an update. This is a direct risk of downtime since many recovery actions are performed remotely.
First, verify monitoring reflects actual hardware: CPU and chassis temperatures, fan speeds, voltages and power supplies, and System Event Log entries. It’s important that identical servers show similar readings under the same load.
Test power profiles. Switch between performance, balanced and energy-saving modes and measure effects on latency (for example, pings between hosts and simple VM load). Energy-saving modes can cause noticeable latency spikes that later hurt clusters and databases.
Minimal BMC checks before a batch purchase:
- power on/off, reboot and correct power-state reporting
- remote console (KVM), mounting an image and booting from it
- availability over the dedicated management network, separate accounts and roles
- settings preserved after firmware updates and reboots
- proper logs and events on overheating and power loss
Practical scenario: on a test server (for example, a GSE S200 unit) create a controlled thermal rise and confirm BMC records temperature increase, raises an alarm and logs the event. Also simulate a PSU failure and verify the server doesn’t unexpectedly reset and the event is logged.
Live migration and HA: validate migrations and recovery
Live migration and HA often appear to "work" until the first real incident. In the pilot confirm not only that migration completes but that it is predictable in time, doesn’t break the network and doesn’t create noticeable pauses for users.
First, align hosts. Migrations fail when CPU features or BIOS settings differ. Ensure hosts have identical firmware, required virtualization modes enabled, the same power settings and a compatible CPU mode if your hypervisor uses it. Otherwise migrations may only work between "similar" hosts.
Test migration under load. Run 2–3 VMs with real profiles: one with heavy disk activity (journaled writes), one with network traffic, another with latency-sensitive workload (VDI or terminal server). Migrate them sequentially and in parallel.
Mandatory checks and records:
- migration of idle and loaded VMs: time, percentage of failed attempts, repeatability
- user impact: application pause, packet loss, session timeouts
- migration network failure: disable the migration port or VLAN and ensure VMs don’t crash and errors are clear
- host failure: forcibly reboot a node and verify automatic VM restart on a neighbor
- resource balancing (DRS-like): move VMs under CPU/RAM pressure without "ping-pong"
Example: for a rack build on GSE S200 Series, create at least two migration networks (primary and backup) and confirm that losing one either cleanly switches migrations or safely blocks them without causing service downtime.
Results here should be formal: a list of conditions under which migration is allowed and measurable thresholds (e.g., maximum pause and acceptable recovery time after a node reboot).
Step-by-step pilot plan for 5–10 working days
Run the pilot on a minimal but honest stand: 2 hosts (for migrations) and, if you plan SAN, a test array or at least the same HBA and cables as in procurement. Immediately agree what you record as the "reference": BIOS, BMC, RAID/HBA firmware, drivers, hypervisor version and settings.
Day-by-day plan
Below is an example that usually fits into 5–10 days depending on scenarios and approvals:
- Day 1: record firmware and install the hypervisor on both hosts. Check that devices are discovered without "manual hacks" and drivers install normally.
- Day 2: configure network per your template (VLAN, bonding/teaming, separate management and migration networks). Run MTU tests (including jumbo frames if required), test single-port failure and switch failover (if you have a pair), and measure recovery.
- Days 3–4: assemble storage profiles (RAID levels, cache, write policy, queue settings). Run I/O tests for read and write, check latency under load, then simulate a disk failure and evaluate rebuild (time, performance drop, cache behavior).
- Day 5: enable monitoring (alerts for disks, RAID, temperature, power), update firmware to the agreed set and rerun key tests to rule out regressions.
- Days 6–7: test live migration, HA and failure scenarios: reboot a host, disable migration network, simulate loss of a storage path, and verify VM recovery.
After the technical part produce a protocol another engineer can reproduce: stand conditions, exact versions and settings (commands or screenshots), metrics (throughput, IOPS, latency, rebuild and migration time) and a prioritized list of problems with reproduction steps. If servers come from a local vendor like GSE.kz, attach this protocol to support conversations.
Common mistakes in pilot tests
Pilots often fail not because of hardware but due to small undocumented details. The outcome looks like random failures: migrations intermittently fail, latencies jump, disks suddenly degrade.
Typical mistakes:
- different BIOS/UEFI and firmware versions across cluster nodes. Updating microcode on one node and not the other can destabilize live migration
- testing only average throughput and skipping link-failure and recovery. A network can be "fast" but fail to recover LACP quickly, causing short VM disconnects
- checking RAID/HBA in normal mode but skipping rebuild tests. Rebuild time is when timeouts, latency spikes and driver bugs often appear
- leaving power profiles at defaults. Aggressive C-states and auto power-savings introduce unpredictable latency, especially for storage I/O and peak loads
- not separating management, storage and migration networks. When traffic shares ports, the hypervisor is blamed, while the design is the issue
Short example: two nodes pass load tests but migrations hang 1 in 10 attempts. Investigation shows one node had a different SR-IOV mode and a newer NIC firmware. Aligning versions and repeating tests removes the issue.
To reduce surprises, agree on pilot discipline:
- record a version matrix: BIOS, BMC, NIC, HBA/RAID, and hypervisor drivers
- log parameters of each run (load, VM count, MTU, power profile)
- include negative checks: port failure, disk failure, rebuild, BMC reboot
- compare results only when configuration and settings match
If the pilot is with an integrator, request a final report in this format: what was tested, which versions, what is reproducible and how to fix it. Then procurement decisions rest on facts, not a single good day on the stand.
Short checklist before the procurement decision
Before signing specs and ordering a large batch, run this short checklist. It helps take factual decisions rather than rely on impressions. Record results while the stand is assembled and easily re-checkable.
Verify key areas and mark yes/no for each. If any item fails, stop and investigate rather than fixing problems in production.
- Network: MTU consistent on switches, servers and hypervisor. LACP stable without flaps. When a port is disabled, traffic and management continue without session drops longer than a few seconds.
- Disks and storage I/O: 4–8 hours of stable load without log errors or performance degradation. Disk and controller temperatures stay within safe ranges; no throttling or sudden timeouts.
- RAID or HBA: required mode (RAID, HBA passthrough, cache policy) works and does not reset after reboot. Rebuild fits acceptable window and a single disk failure doesn’t crash the host or hang VMs.
- BMC and power: remote console available and stable, events logged, power on/off and reboot work correctly. Passwords and roles are documented and there is a way to recover access without a physical visit.
- Live migration and HA: migrations run in series and under load between hosts. After a test host failure, VMs recover as expected without long pauses or data corruption.
End with a short protocol: firmware and driver versions, NIC and controller models, settings (MTU, LACP, RAID/HBA mode), test results and final pass/fail. This is critical where procurement must be verifiable and reproducible.
Example scenario: pilot before buying cluster servers
A company plans a 24/7 data center upgrade and intends to buy 40 identical servers for a single cluster. A hypervisor compatibility issue at that scale quickly leads to downtime, night interventions and disputes with suppliers. So a pilot runs on a minimal but honest stand.
They assemble three identical servers (the target model), two switches and a set of 10 VMs: a database, two application servers, a file service and a VM with heavy disk load. They immediately record versions: BIOS/BMC firmware, RAID or HBA firmware, drivers, hypervisor version and network settings (bonding, VLAN, MTU).
Initial tests pass, but under load two surprises appear. Some migrations stall or take 3–4 times longer. Disk subsystem shows latency spikes, and after simulating a disk failure rebuild heavily impacts performance and some VMs start to hang.
Analysis reveals a common combo: outdated RAID firmware and an unsafe cache mode. After updating firmware (including NICs), switching cache policy and tuning I/O queues, tests are rerun.
Before final decision they run a short acceptance series:
- migrations under concurrent network and disk load
- single link and single switch failure with availability maintained
- array rebuild (or HBA path degradation) with latency measurements
- node reboot and HA restarts for VMs
- BMC power management checks (including access and roles)
They record a reference configuration for procurement: exact NIC and HBA/RAID models, firmware versions, cache settings, network parameters and an acceptance test list. If procurement goes through an integrator like GSE.kz, include these as mandatory delivery and support requirements so the entire batch arrives in the same verified state.
Next steps after the pilot and how to lock results
After the pilot, don’t just say "it works" — document what you tested and under which conditions. Otherwise a delivered batch may contain different NIC revisions, another controller or firmware, and your pilot results become invalid.
Create a compatibility matrix in one document: server model, NICs, RAID or HBA, drive types, firmware versions (BIOS, BMC, controllers), hypervisor version and drivers. This becomes the reference for procurement and support.
Define acceptance criteria and owners in advance: what counts as success (stable live migration, correct RAID rebuild, no driver errors under load) and who signs off. The signature matters so tests have official status.
Allow time for updates and re-tests. Updated BIOS or controller firmware? Repeat key scenarios. This is normal in a pilot.
To ensure batch uniformity, create a golden configuration profile:
- BIOS and power modes
- BMC settings and access accounts
- network parameters (MTU, VLAN, bonding, offloads)
- RAID or HBA modes (passthrough, cache policy)
Then lock the result into an implementation plan: who applies the profile, how each delivery is checked against the matrix, and which acceptance tests are repeated on receipt.
If you lack internal capacity or want an independent view, involve GSE.kz as a manufacturer and integrator: they can help select a hypervisor-compatible configuration, fix the golden settings and run repeatable pilot tests before the full procurement.
FAQ
Why run a pilot if the server "installs and boots"?
A pilot confirms that your exact combination — server model, firmware, drivers, hypervisor and settings — behaves predictably under prolonged mixed load. It reduces the risk of buying a batch that looks fine in synthetic tests but causes timeouts, instability or failed migrations in production.
What must be checked in a compatibility pilot?
Record four areas: functionality (required modes like VLAN/LACP, Secure Boot, multipath), performance (throughput and latency), stability (no hangs, reboots or critical errors in logs) and manageability (BMC, updates, recovery after failures). If any area is unstable, investigate before procurement.
What is the minimal honest test-stand?
Build a stand similar to the planned deployment: at least 2 identical hosts for migrations, ideally two switches for redundancy, a separate management network and the same type of storage that will be used in the project. The closer the stand to production, the fewer surprises.
Which versions and settings should be recorded to make results repeatable?
Freeze versions and settings before you start: BIOS/UEFI, BMC, RAID/HBA and disk firmware, NIC drivers, hypervisor version, power profile, NUMA layout and network options (MTU, offload, SR-IOV). If you change something, note the reason and repeat key runs, otherwise results are not comparable.
How to quickly catch typical network issues in virtualization?
Start with basic hygiene: same MTU end-to-end, correct VLANs, no IP conflicts and clear separation of networks (management, storage, migration). Then check packet loss and latency growth during long runs, and test behavior when ports or a switch fail.
Which disk and storage I/O tests reveal future problems?
Test not only peak numbers but stability under realistic VM profiles: random small-block access, mixed read/write and an 8–24 hour run. Look at p95/p99 latencies and logs — average throughput often hides spikes that throttle applications.
How to decide between RAID controller and HBA, and what to check?
Decide the mode by your storage model: hardware RAID for local datastores, HBA/JBOD/passthrough for SDS. Verify cache policies, queue depths and that settings persist after reboot. Ensure degradation and recovery events are reported to logs and monitoring.
How to properly test array rebuild and disk failure?
Measure rebuild in a realistic environment: run typical VM load, then simulate a disk failure. Record rebuild time, IOPS drop, latency increase and VM behavior (timeouts, hangs, filesystem errors). Rebuild is when firmware, cache and driver issues usually surface.
What to check in BMC to avoid trips to the server room?
Ensure BMC is reliably reachable over the dedicated management network, reports sensors and events correctly, and remote console and image mounting work without timeouts. Verify settings survive firmware updates and reboots — BMC problems often lead to physical visits and downtime.
How to test live migration and HA to get useful results before purchase?
Make hosts as identical as possible in firmware and CPU/BIO settings so migrations aren’t blocked by feature differences. Run migrations under load, measure duration and failure rate, and test HA by forcing a host reboot and checking VM restart behavior on the neighbor.