Nov 23, 2025·7 min

Disk acceptance tests on servers: fio/DiskSpd and thresholds

Disk acceptance tests on servers: how to configure fio and DiskSpd, choose profiles for databases, files and virtualization, prepare a report and set failure thresholds.

Disk acceptance tests on servers: fio/DiskSpd and thresholds

Acceptance testing of disks in a server is a short set of load tests run right after delivery or replacement, before putting the drives into production. The task is simple: make sure drives behave predictably and the batch is uniform. In practice, this check saves hours of downtime and long incident investigations in production.

You can’t rely solely on specification numbers. Those are usually taken in lab conditions (often a pleasant scenario like sequential read). Real-world load is mixed: small blocks, parallel streams, queue depth, SSD background cleanup, controller caches, RAID, filesystem. So a disk can be “fast on paper” but produce latencies that hurt databases, virtualization and file services.

Acceptance finds typical supply and compatibility issues: mis-sorting (wrong model or capacity), unsuitable firmware, mixed revisions in one batch, degradation after storage, overheating in a tray, and sometimes outright defects. In server environments this is critical: one “bad” disk in an array can spoil the whole picture.

More often than not, the problem is not lack of peak speed but instability. For example, out of eight new SSDs one might periodically spike to hundreds of milliseconds on writes. On a file server this looks like “hangs”, and in a virtualization cluster—as VM freezes and higher application response times. Such a disk may pass a superficial SMART check but fail under load.

To avoid turning acceptance into guesswork, agree in advance which metrics you’ll record:

  • performance (IOPS and MB/s) in the required modes
  • latency (average and p95/p99/p99.9)
  • stability over time (are there drops or a “saw” pattern)
  • I/O errors, timeouts, resets
  • variance across disks in the same batch

Then everything comes down to two questions: which workload profiles match your services, and which thresholds count as a failure rather than a configuration peculiarity.

Before tests: what affects results and how not to confuse layers

You can ruin results before fio or DiskSpd runs. Start not with commands but with recording conditions: which drive, which controller, at what level you measure and which caches are enabled.

Expectations depend heavily on drive type. NVMe usually gives low latency and high IOPS. SATA SSDs are limited by interface and controller. SAS often behaves more steadily under sustained load. HDDs are good at sequential tasks and drop sharply on random access. If you compare everything as the same, you’ll either reject a normal HDD or miss a problematic SSD.

A separate topic is RAID/HBA and cache. Write-back easily produces good-looking numbers while writes sit in controller cache rather than media. Write-through is more honest but slower. Cache protection (BBU or supercapacitor) matters: without it, write cache is often disabled and system behavior changes. Record cache mode and write policy before testing.

Filesystem also matters. A “file-level” test depends on the FS, block size, mount options and background jobs. Block-device tests are closer to the hardware. For acceptance it’s useful to know exactly what you’re testing: the drive itself, the storage stack, or the setup “as it will be in prod”.

Parallelism shapes the load. Queue depth (QD) and number of threads (numjobs) can either reveal the drive or hide defects. Too small a queue under-reports IOPS; too large a queue inflates queues to tens of milliseconds and can mask rare failures.

Before running, record:

  • drive type and interface (SATA/SAS/NVMe), capacity, firmware
  • controller mode (RAID/HBA), write-back/write-through, presence of BBU/supercap
  • test level: file on FS or block device, how things were formatted
  • load parameters: block size, QD, numjobs, warm-up duration
  • background: RAID rebuild, scrub, updates, telemetry

Don’t confuse measurement layers. The same drive can look different:

  • on the physical disk (minimum "magic")
  • on the RAID/array volume (cache and policy effects)
  • on an OS LUN/volume (drivers and scheduler)
  • inside a guest VM (virtual queues and hypervisor limits)

If you accept servers and storage as an integrator, a practical approach is to run tests at two levels: “hardware” (to find defects) and “as it will run” (to confirm expected production performance).

Testbed preparation and safety rules

First decide what you’re testing: a single drive, a RAID group or the whole storage path (drive - controller - driver - OS - filesystem). Mixing levels makes results wander and defects easy to miss.

Tools and system observation

Minimum toolkit: fio for Linux, DiskSpd for Windows, and monitoring. During a run watch not only throughput but signs of trouble: rising latencies, controller errors, overheating, relocated blocks.

For monitoring, standard OS tools and controller/disk utilities are usually enough: SMART/NVMe log, event logs, CPU load, disk queue length, temperature, I/O errors. Enable log collection beforehand so you can tie a test failure to a specific minute.

To avoid confusion in hardware, agree on naming before the first run. A convenient format: server - slot/bay - serial number - disk role (OS, DATA, LOG, CACHE). Example: SRV-DB01 BAY03 S/N XXXXX DATA.

Also fix the environment in advance, otherwise there’s nothing to compare: OS and power plan version, drivers and firmware (controller, disk), RAID mode and stripe size, cache policy, filesystem and block size (if testing at volume level).

Warm-up and stabilization (precondition)

New SSDs often show “showroom” figures while free and cold. Acceptance needs warm-up: fill and load the drive until latencies stabilize. For most SSDs a reasonable minimum is 30–60 minutes of heavy writes (or 1–2 full fills), then a 5–10 minute pause and only then measurements.

Main safety rule: test only on an empty disk or a dedicated test volume. One wrong parameter (wrong disk, wrong path) and data is lost.

Set stop-rules before starting:

  • don’t run tests on production volumes or LUNs with important data
  • double-check what will be overwritten (path/letter/disk) before starting
  • disable background jobs that skew results (backups, antivirus scan)
  • monitor temperature and cooling (especially NVMe in dense trays)
  • if I/O errors occur, stop the run and save logs

This way the testbed becomes predictable and results comparable across batches and servers.

Step-by-step scheme for fio and DiskSpd acceptance tests

Stick to the same scheme so you can compare results between batches, models and testbeds.

1) Choose the scenario and goal

Decide the level: single drive, RAID group, SAN volume or hypervisor datastore. Then pick a profile close to real services: DBs (small blocks and low latency), file service (mixed blocks and sequential behavior), virtualization (mixed read/write and high parallelism).

2) Warm-up and basic stability check

Before measurements do a warm-up: write across the test file/volume and run a short job to let speed plateau. If the graph jumps and latency shows a “saw” pattern, that’s already a signal.

Follow simple rules:

  • run several block sizes (4K, 8K, 64K, 1M) and several queue depths (e.g., QD 1, 4, 16, 32) to see single-thread and peak behavior
  • record not only IOPS/MB/s but latency tails (p95, p99, p99.9)
  • watch for rare pauses of hundreds of milliseconds: those kill DBs and VDI even with good averages
  • save artifacts: fio/DiskSpd logs, testbed configuration, SMART and system logs

Repeatability is mandatory: do 2–3 identical runs and decide based on the median, not the best result.

Example minimal commands (substitute your path and size):

fio --name=rand4k --filename=/mnt/testfile --size=50G --direct=1 --rw=randrw --rwmixread=70 \
 --bs=4k --iodepth=16 --numjobs=4 --time_based --runtime=180 --group_reporting \
 --output=fio_rand4k_mix.json --output-format=json
DiskSpd.exe -b4K -d180 -W30 -o16 -t4 -r -w30 -Sh -L C:\test\testfile.dat

Which metrics to treat as acceptance criteria: what matters and why

In acceptance, predictability matters more than records. A drive can show high average results but occasionally stall for seconds. That’s what later becomes DB hangs, slow VMs and queues on a file server.

Performance: IOPS, latency and MB/s

IOPS are useful for small blocks (4K–16K) and random access, but don’t look only at a run average. Check stability over time: sharp drops at the same load often point to throttling, caching, firmware or a defect.

Latency is almost always more important than IOPS. The average hides rare but painful latencies, so record p95/p99/p99.9 in your report. For virtualization and DBs p99 and p99.9 usually explain user complaints.

Throughput (MB/s) matters for sequential flows and large blocks (128K–1M): backups, exports, scanning big files. If throughput “steps down” under constant load, that often indicates SLC cache exhaustion, overheating or controller issues.

Reliability: errors and temperature

Any errors trump “slightly lower IOPS”. Note read/write errors, timeouts, resets, media errors, and for SSDs changes in SMART error counters. Even single timeouts under load can later cause RAID, filesystem or hypervisor problems.

NVMe overheating often shows as steps: normal IOPS and low latency initially, then rising latency and falling throughput at the same queue and profile. Save context with numbers: temperature sensors and cooling conditions.

Useful attachments for the report:

  • fio/DiskSpd logs (preferably with percentiles and time breakdown)
  • SMART before and after
  • OS events about resets/timeouts (if any)
  • NVMe temperature during the run
  • driver and controller firmware versions, power mode

Typical test profiles for databases

Storage subsystem audit
We’ll check RAID/HBA, cache policies and root causes of latency spikes before commissioning.
Order audit

Databases usually have two loads: small random operations (OLTP) and large sequential reads (analytics). Transaction logs require steady latency and no rare long pauses. So run several short precise profiles rather than one “catch-all”.

A quick set for batch comparison:

  • 4K random read: QD 1 and QD 32, 1–4 threads (watch IOPS and p99)
  • 4K random write: QD 1 and QD 32, 1–4 threads (watch p99 and "steps")
  • 70/30 random (read/write): 8K block, QD 8–32, 4–8 threads (typical OLTP)
  • mixed 16K: 50/50 or 70/30, QD 8–16 (when some requests are larger than 8K)

For analytics check sequential read 64K–1M with QD 4–16 and ensure throughput holds steady. For transaction logs run a separate test: 4K–16K sync write, small queue (QD 1–4), 1 thread. Here predictability matters more than peak.

Typical test profiles for file services

File services (SMB/NFS) load drives differently than DBs: more sequential reads of large files, office mixes and bursts of small-file activity. Test these modes separately.

Basic profiles

  1. Sequential read of large files (backups, media, images):
fio --name=seqread --filename=/mnt/testfile --size=200G --direct=1 \
  --rw=read --bs=1M --iodepth=32 --numjobs=4 --runtime=180 --time_based \
  --group_reporting
  1. “Office” mix: random 64K with read predominance (catches cases where throughput is fine but latencies jump):
fio --name=mix64k --filename=/mnt/testfile --size=100G --direct=1 \
  --rw=randrw --rwmixread=80 --bs=64K --iodepth=16 --numjobs=4 \
  --runtime=300 --time_based --group_reporting
  1. Many small files: here IOPS and latency tails matter, so raise parallelism (numjobs/iodepth) and always record p95/p99.

On Windows you can reproduce the same ideas with DiskSpd, for example for 64K 80/20:

DiskSpd.exe -c100G -d300 -Sh -L -o16 -t4 -b64K -r -w20 C:\testfile.dat

A useful competition check: run a sequential read (1M) simultaneously with a random write (4K–16K). This shows whether reads suffer when writes are happening.

Note on SMB/NFS

Test disk/volume on the server (fio/DiskSpd) first, then test SMB/NFS from a client. Otherwise you won’t know where the bottleneck is: disk, CPU, network, protocol settings, antivirus or encryption.

Typical test profiles for virtualization and VDI

GSE servers for predictable I/O
We’ll select GSE server configurations for your I/O profile and scaling needs.
Request configuration

Virtualization is rarely pure read or write. It’s usually a mix of small random ops where stable latency matters more than peak IOPS.

Profile 1: mixed load “general virtualization”

70/30 or 60/40 read/write, 8K block (4K–16K acceptable), high parallelism:

fio --name=virt-mix --filename=/dev/nvme0n1 --direct=1 --ioengine=libaio \
 --rw=randrw --rwmixread=70 --bs=8k --numjobs=8 --iodepth=32 \
 --time_based=1 --runtime=300 --group_reporting=1 --lat_percentiles=1

Then run a short queue-depth “staircase”: iodepth 1, 4, 8, 16, 32. If p99 latency jumps in steps, that often reveals firmware, controller or thermal issues.

Profile 2: VDI (tight p99 requirement)

80/20 read/write, 4K, many threads but not extreme iodepth:

fio --name=vdi-boot --filename=/dev/nvme0n1 --direct=1 --ioengine=libaio \
 --rw=randrw --rwmixread=80 --bs=4k --numjobs=16 --iodepth=16 \
 --time_based=1 --runtime=300 --group_reporting=1 --lat_percentiles=1

If p99 drifts, users will experience login or profile-open stutters even when average latency looks normal.

Profile 3: dense consolidation (many VMs)

Increase numjobs (e.g., 24–48) but watch CPU on the test machine. Monitor CPU load and queue lengths.

If you have a RAID/HBA with cache, run the same profile twice: write-back and write-through. A big difference in latency tails indicates production behavior will depend heavily on cache settings and presence of BBU/supercap.

Acceptance thresholds: how to set values that catch defects

Thresholds are not for pretty reports but to reliably filter hidden issues: unstable latency, overheating, controller errors, drops under sustained load.

Which values to control

Don’t rely on averages. For most roles tails and stability matter more:

  • IOPS and MB/s per profile (and repeatability across runs)
  • p99, and for critical roles p99.9 (separately for read and write)
  • time-based drops relative to median (e.g., by minute)
  • zero tolerance for I/O errors, timeouts, resets, media errors
  • temperature and signs of throttling

Latency thresholds often catch defects better than IOPS thresholds. A drive can hold performance figures but exhibit rare long stalls.

How to set thresholds: reference and role-based

A safer approach is two-layer thresholds: from a reference drive (or validated batch) and from role requirements.

  • Reference: run 1–2 validated drives of the same model with the same profiles, then allow a deviation (e.g., no worse than 10–15% on median and no worse than 20–30% on p99).
  • Batch: compare drives within the batch. An instance consistently worse on p99 or stability is suspicious even if it formally passes.
  • Role: DB thresholds on p99/p99.9 are stricter; file services care more about steady throughput in sequential tests.
  • Stability: recurring drops (e.g., every 3–5 minutes IOPS fall by 30%+) warrant investigation before commissioning.

For temperature consider the combo: rising temperature plus falling IOPS or rising p99. If repeated under the same conditions, mark as throttling and include in thresholds (or fix cooling if production conditions differ).

Common mistakes and traps in acceptance

Turnkey data center infrastructure
We’ll help build data center infrastructure: servers, storage, network and deployment.
Get consultation

Even good tests are easy to spoil by conditions. Then a “bad” drive looks normal or you reject a working one.

Most common causes of false results:

  • test file too small and fits in cache (OS, controller, SSD SLC)
  • test runs too short and doesn’t show garbage collection, warm-up, SLC-cache or throttling
  • volume busy with background jobs (antivirus, indexing, backups, replication)
  • unit confusion (MB/s vs MiB/s) and meaningless IOPS without block size
  • comparing non-comparable setups: different drivers/firmware, different RAID mode, different cache, different QD and threads

A separate trap is blaming a specific drive when measuring an array. Degradation can be in the controller, cable, slot, backplane or power. A practical example: two identical servers show different 4K read results—often one has write-back with BBU and a different RAID driver, while the other has write cache disabled after a firmware update.

One household tip that saves time: record serial numbers, firmware versions, controller model and firmware. Without that you can’t repeat the test or quickly localize a bad unit when drives are moved between servers or sites.

Checklist, mini-report template and a short example

To make results comparable and defensible with a supplier, keep a short checklist and a simple report template.

Pre-run checklist

  • Record configuration: RAID/HBA, BIOS/firmware versions, controller mode (write-back/write-through), cache policies.
  • Ensure you test the correct level: drive, array or LUN, and that no rebuild/scrub is running.
  • Choose the same test-file size (often 1.5–2x RAM to avoid hitting host cache).
  • Warm up with the chosen profile before measuring.
  • Check cooling and power: disk and controller temperatures should be normal before and during the run.

Mini report template

The report should let anyone reproduce the test and quickly find a “bad” unit:

  • Identification: model, capacity, type (HDD/SATA SSD/NVMe), serial number, slot/port, date, batch.
  • Environment: server, OS, driver, fio/DiskSpd version, test parameters (block, queue depth, threads, time, file size).
  • Storage settings: RAID level, stripe size, controller cache, filesystem, mount options.
  • Results: IOPS, MB/s, average latency and p95/p99/p99.9, time stability, temperatures.
  • Conclusion: pass/conditional/fail and what was rechecked.

Attach raw logs (fio/DiskSpd text/JSON), timestamped metric snapshots (at least from monitoring), and system logs for disk/controller errors.

Practical example: in a rack server, among 12 identical SSDs one shows normal IOPS but p99 latency 3–5x higher with visible spikes every 30–60 seconds. First repeat the run in the same slot. Then move the drive to another slot to rule out port/cable/backplane. If the problem moves with the drive, mark it defective and request replacement.

If you accept servers and storage as part of deployments, agree on a unified report template, thresholds and recheck procedure in advance. In projects where GSE.kz (gse.kz) handles delivery and integration, such standardization helps: fewer disputes, faster diagnostics and clearer operation after launch.

FAQ

Why run acceptance tests on disks if the drives are new?

Disk acceptance is a short set of load tests run immediately after delivery or replacement, before commissioning. It helps catch unstable latencies, overheating, I/O errors and batch non-uniformity early—issues that later become service “hangs” and lengthy incident investigations in production.

Why can’t we just trust the vendor IOPS and MB/s numbers?

Published specs are usually measured in convenient lab scenarios and don’t reflect your mix of block sizes, queues, threads, RAID caching and SSD background activity. In practice, predictability matters more than peak speed: stable latency without rare pauses and no time-based degradation.

Which metrics matter in acceptance besides throughput?

Measure IOPS and MB/s for the relevant profiles, but always include latency percentiles p95/p99/p99.9 and stability over time. If a drive occasionally “freezes” for hundreds of milliseconds, that will hurt databases and virtualization even when averages look fine.

What typical supply and compatibility issues does acceptance catch best?

Acceptance most often uncovers mixed firmware or revisions inside a batch, hidden defects, degradation after storage, overheating in dense trays, timeouts/resets under load, and mis-sorted models/capacities. Also common: drives that look OK in SMART but fail on latency tails under load.

How to start so test results are comparable and fair?

Start by locking down conditions: model and firmware, controller RAID/HBA mode, write cache policy, test level (block device or file), and background processes. Then do a warm-up and only measure after results stabilize—otherwise you compare showroom numbers, not real behavior.

Is it better to test at the block device or through the filesystem?

If you want to find defective drives and compare them, block-device testing is usually more honest and closer to the hardware. If you want to know how a specific service will behave, test at the same level as production, including RAID and filesystem. Often we do both: hardware level first, then “how it will run in prod”.

How do RAID controller and its cache distort results and what to do?

Write-back can show attractive numbers while writes are in the controller cache and not yet on media, which can be misleading. Write-through is usually slower but reflects real writes to disks. Record cache mode and whether cache protection exists (BBU/supercapacitor) because behavior under load will differ without it.

Which fio/DiskSpd profiles to choose if time is limited?

At minimum, warm up the drive with heavy writes until latencies stabilize, then run a few block-size and queue-depth combinations to expose single-thread and peak behavior. A basic acceptance set is 4K random profiles, one mixed workload and one large-block sequential test—judge by repeatability and latency tails.

How to tell a single bad disk in a batch when numbers look similar?

A defective drive often stands out by consistently worse p99/p99.9 or by periodic time-based failures under identical load, even if average IOPS look acceptable. Compare to a known-good reference of the same model and compare drives inside the batch rather than chasing an “absolute” number.

How to run acceptance safely without destroying data or missing errors?

Test only empty drives or a dedicated test volume—one wrong path and you can lose data. If you see I/O errors, timeouts or resets, stop the run and save logs, SMART/NVMe logs and system events to tie the failure to specific hardware and time. Also monitor temperature, especially for NVMe in dense trays, since throttling can look like random drops.

Disk acceptance tests on servers: fio/DiskSpd and thresholds | GSE