Tuning BIOS for Database Servers Without Risk
BIOS tuning for database servers: which NUMA, SMT and C-states options to test, how to measure effects and keep a rollback-safe log without losing stability.

Why touch BIOS for a DBMS and what are the risks
BIOS affects how a server distributes work across CPUs and memory, how it enters power-saving states and how quickly it wakes up. For a database this is often about predictability: the same load should produce the same numbers. Tuning BIOS for a DBMS is usually justified when you hit latency limits or see "uneven" performance while the OS and database are already reasonably tuned.
Common BIOS-related symptoms include: queries that are sometimes fast and sometimes suddenly slow under the same load; unexplained drops in TPS or QPS; rare spikes in response time at night or during partial idle periods; differing results for identical tests run at different times; "jumping" CPU utilization (40% one moment, 90% the next) under similar workloads.
There is no universally "correct" configuration. OLTP (short transactions and high concurrency) is usually more sensitive to latency and thread scheduling. NUMA, SMT (Hyper-Threading) and C-states matter more there. Analytics and batch jobs often benefit from maximum throughput and steady CPU load, and the same BIOS settings can have the opposite effect.
Risks are real even if changes look innocuous. You can get rare failures that appear a week later. Or a "hidden degradation": average speed rises, but tail latencies (p95/p99) worsen. Another common problem is rollback complexity if you changed several settings at once and didn't record the original state.
Success here is not a single best run. You need stability and repeatability: the same load should yield the same metrics, and effects should be evaluated by more than average TPS — by latencies, background tasks behavior and how the system recovers after idle.
Preparation: baseline and safe rollback plan
Before changing NUMA, SMT or C-states, record a starting point. Otherwise you won't know if things improved and won't be able to return quickly to a working configuration.
First collect baseline hardware and firmware data. This matters: identical settings can behave differently across CPU generations and microcode versions.
At minimum, record in one document:
- server model and components: CPU (sockets, cores), RAM size and layout (channels, frequencies), disks/NVMe, network cards;
- BIOS/UEFI, BMC and CPU microcode versions (if visible), boot mode (UEFI/Legacy);
- server role and DB workload type (OLTP/analytics), constraints (latency SLA, maintenance window);
- current BIOS settings (screenshots or exported profile if supported);
- OS parameters that may affect tests: power profile, scheduler settings, huge pages, pinning/affinity, kernel and driver versions.
Next, schedule a work window. Any BIOS edit requires a reboot, and sometimes a full power cycle to apply settings correctly. Allow time for 2–3 reboots and verification.
Prepare a rollback plan before the first change. Best is to save the current BIOS profile to a file or profile slot. If that's not available, take screenshots of relevant sections and write down the values you will change.
Predefine stop criteria to avoid pushing the system into instability:
- system log errors (Machine Check, WHEA), unexpected reboots, rise in corrected errors;
- hangs, I/O timeouts, anomalous pauses in DB responses;
- deterioration of key metrics beyond a set threshold (e.g., p95 latency);
- inability to reproduce results in two consecutive runs.
If any of these appear, roll back to the baseline profile and investigate the cause separately — don’t continue experimenting "by chance".
Step-by-step testing methodology without risking production
Careful BIOS tuning starts with a goal. Choose one metric that matters for your DB: reduce p95 latency, increase TPS, or eliminate rare long pauses.
Then capture a baseline. Run the same workload for the same duration and dataset, with identical OS and DB settings. If testing on a bench, make hardware and firmware as close to production as possible.
Change one BIOS parameter at a time and always reboot the server. This separates the effects of NUMA, SMT (Hyper-Threading) or C-states from random factors. If you must change several items, save them as a separate profile and leave that for later — first analyze each change's contribution.
To avoid false effects, repeat the same run 3–5 times and look at the median and spread, not the single best result. If p95 sometimes falls and sometimes rises, the problem is usually test instability, not the setting.
Decide in advance what you consider an improvement. A small TPS gain may not be worth it if p95 worsens or power consumption rises. If the effect is ambiguous, revert the change and rerun the baseline to check repeatability.
Record the context of each test so comparisons are fair:
- BIOS/UEFI versions, CPU microcode, RAID/NVMe and BMC firmware;
- exact parameter value (e.g., SMT: On/Off, C-states: Enabled/Disabled);
- power profile, clock/frequency behavior, ambient conditions and fan speeds;
- OS and DBMS versions and key settings (memory pools, parallelism);
- metrics: TPS, p50/p95/p99 latency, CPU steal/iowait, cache misses if available.
Final step — a long stability run. Once you pick the best profile, run the workload for hours, ideally a day: check logs for errors, WHEA entries, performance degradation and overheating. In production, apply changes only during a maintenance window and with a clear rollback path (saved BIOS profile and tested recovery steps).
NUMA: what to test and how to tell it helped
NUMA in simple terms means the server has several "islands" of memory local to certain CPU cores rather than one shared pool. Access to local memory is usually faster than to remote node memory. For DBMS this matters: the DB constantly reads and writes RAM, and extra micro-latencies to remote memory quickly reduce performance.
Which BIOS NUMA options make sense to test depends on the platform, but common items are:
- NUMA enabled or disabled (sometimes labelled Node Interleaving: Enabled makes the system behave more like UMA);
- memory distribution modes between nodes (interleaving, local binding, auto);
- number of visible NUMA nodes (rare, but some platforms allow combining or splitting nodes);
- CPU topology settings (for example, NPS on some platforms).
NUMA effects are more visible on many-core systems with lots of RAM and high thread contention: when background DB tasks, reporting queries and replication run simultaneously. On a dual-socket OLTP server, the DB may miss caches and "walk" remote memory if the OS scheduler and memory policy don't match the real topology.
Test carefully: one parameter — one set of runs (with repeats), and compare not just averages but tail latencies.
To tell if it got better, watch three things: latencies (p95/p99), throughput (transactions or queries per second) and fraction of remote memory accesses. The OS usually exposes how much memory and which threads are on each NUMA node — watch if "remote" usage grows under load.
Log results so you can reproduce them later:
- server model and BIOS version;
- number of NUMA nodes and whether interleaving is enabled;
- process and memory binding settings at the OS level (if used);
- key DB and OS metrics before and after;
- test description: duration, dataset size, concurrency.
SMT (Hyper-Threading): when to turn it off and how to verify effect
SMT adds logical threads per physical core. It often increases overall CPU utilization, but does not always speed up real DB queries. Evaluate latency and predictability, not just average throughput.
SMT often hurts when the DB is latency-sensitive and CPU-bound: cache and execution-unit contention increases, extra context switches occur, heating can rise and throttling can kick in. A typical pattern is steady 80–95% CPU with worsening p95/p99 latency.
SMT often helps with mixed workloads that include lots of I/O wait or many lightweight parallel queries. Logical threads can fill idle slots in a core, raising throughput without a noticeable hit to tail latencies.
Test SMT in paired runs: SMT On vs SMT Off with the same dataset and software version. Do not ‘‘tune’’ load levels — keep the same number of active sessions and connection limits, otherwise you are just changing CPU pressure.
Practical test order:
- do a short warm-up, then several identical runs (at least 3) per mode;
- record CPU frequencies and temperatures to rule out hidden throttling;
- do not change NUMA, OS power profile or DB parameters in parallel.
Watch not only TPS/QPS but p95/p99 latency for key queries, per-core CPU utilization and increases in context switches, and frequency stability (Turbo may behave differently).
Example: on a GSE S200-class server under OLTP, SMT Off sometimes wins if even latencies are critical, while SMT On can be better for reporting workloads with parallel scans. Decide based on repeatable measurements, not a single lucky run.
C-states, Turbo and power limits: latency vs savings
C-states control how deep cores and the package can sleep when idle. For DBMS this can introduce micro-delays: a short query arrives, a core wakes up and response time fluctuates slightly. On long batch jobs (ETL, backups, reports) the effect may be negligible, but on OLTP with many short transactions latency variance is often visible.
Usually compare 2–3 clear modes rather than dozens of tiny options. Common items to try:
- C-states: enabled vs disabled;
- package C-states (socket sleep depth): limit to shallow states or disable deeper ones;
- platform power profile (Performance/Balanced) if available in BIOS.
Turbo raises frequencies on some cores but is limited by temperature and power limits (PL1/PL2). Under an "all cores at 100%" test Turbo can quickly decay due to heating. Results then depend on whether the server hit throttling, not on the DB itself. Frequency stability matters as much as raw numbers.
How to test without surprises
Change one thing at a time and run tests long enough for thermal equilibrium (often 30–60 minutes, and preferably longer). Keep conditions identical: rack temperature, fan profiles, microcode/BIOS, and the same workload profile.
A "nervous" scenario works well: mix short index queries with background write activity to reproduce spikes.
What to record
To avoid arguments later, log:
- p95/p99 latency and number of timeouts (if any);
- CPU frequencies under load and any drops;
- CPU temperature and fan speeds;
- signs of throttling (thermal or power);
- power consumption if available, to understand the cost of gains.
If disabling C-states improves p99 but raises temperature and triggers throttling, the long-term result may be worse. A good configuration is fast and predictable day after day.
Memory and RAS in BIOS: settings to touch cautiously
For databases, stability matters as much as raw numbers. Memory settings are an area where a small gain can be bought at the cost of rare errors, odd hangs or degraded performance under stress.
Memory: what can affect latency and stability
Beginners often change memory frequency and mode. Higher frequency normally gives more bandwidth but can sometimes increase latency or reduce stability margin, especially with all DIMM slots populated.
Test only changes that are easy to explain and roll back:
- memory frequency (Auto vs a fixed supported value);
- memory profiles (start with JEDEC/default profiles rather than aggressive XMP-like settings);
- channel/rank interleaving (often helps even access);
- controller modes labelled "performance"/"balanced" if present.
If BIOS shows timings, record them but don’t tweak manually without reason. Auto settings sometimes pick values "on the edge," and manual changes make rollback harder.
RAS: reliability over speed
RAS settings (Reliability, Availability, Serviceability) help systems run for years and correct single-bit errors. For DBMS this is often worth more than a small performance gain.
Be cautious with:
- ECC modes (usually leave enabled);
- memory scrubbing (patrol/background scrubbing): reduces error accumulation but may add a small constant background load;
- error correction thresholds and policies: aggressive modes may fix issues but also surface more hardware problems.
Simple example: you raise memory frequency and see +3% in a synthetic test, but a week later corrected ECC errors appear. The system still runs, but risk increased — better to revert and inspect DIMMs.
How to test safely and what to log
Compare latency and memory throughput, then run a long stability test under DB load (not in production) — several hours, better a day.
In the log, record:
- server model, BIOS version, DIMM configuration;
- memory frequency/mode and interleaving settings;
- any RAS parameters you changed;
- latency/bandwidth metrics and load test results;
- counts of corrected/uncorrected memory errors.
Rule of thumb: if corrected errors rise or uncorrectable errors appear after a change, roll back immediately even if benchmarks look better.
PCIe, network and NVMe devices: BIOS items that can affect I/O
Even with CPU and memory tuned, I/O (disks, network, controllers and how they connect over PCIe) is often the bottleneck for DBMS. BIOS contains options that can affect I/O latency and stability.
PCIe and network features
First check PCIe mode. Auto usually works, but fixing a generation (e.g., Gen3/Gen4) can eliminate rare link negotiation issues and stabilize latency. If you see odd drops, test both modes.
Pay attention to options relevant only in some environments:
- SR-IOV — relevant when passing network or storage to VMs or using advanced NIC features;
- IOMMU — needed for virtualization and device isolation but can add slight overhead and change latencies.
If the server runs without virtualization or device passthrough, avoid changing these without reason.
NVMe and power-saving
NVMe latency is critical. Check for BIOS power-saving modes for PCIe or drives that put devices into deep sleep. On graphs these show as rare but notable latency spikes, especially for small-block reads.
Test in two phases: isolated I/O stress (read/write, small and large blocks), then the same profile alongside typical DB load.
To keep results reproducible, record:
- BIOS and RAID/HBA/NVMe and NIC firmware versions;
- agreed PCIe generation and lane width (Gen and x8/x16);
- SR-IOV/IOMMU and related options;
- disk latencies: mean and p95/p99, plus IOPS under the same workload;
- test conditions: same power profile, driver versions and DB configuration.
How to record results: metrics, log and repeatability
To avoid guessing, decide in advance what you compare. One parameter at a time, identical load, identical data and identical duration. Then even small effects become visible and random fluctuations won't mislead you.
Metrics to always collect
Pick metrics that reflect speed and stability. A common minimal set:
- TPS or total throughput;
- p95 and p99 latency;
- CPU: utilization, frequency, core migrations, context switches;
- memory and I/O: page faults, read/write throughput, disk queue depth;
- temperatures and power limits (signs of throttling).
TPS can rise while p99 worsens — for DBMS that is often a bad trade.
Simple log you can actually keep
A single table works best where you record each run. Minimum columns:
- date/time and test window;
- server model, BIOS version and important firmware versions (if changed);
- changed BIOS parameter and its value (one at a time);
- environment description: DB version, config, dataset size, client count, duration;
- results of 3–5 runs (median and spread) and notes on anomalies.
Example: testing SMT with five 20-minute runs after identical warm-ups. If one run "failed" due to a background job, mark it and don't draw conclusions from a single number.
Separately log risk signs: reboots, WHEA errors, frequency drops, overheating, unexpected p99 spikes. If these appear after changing NUMA, C-states or SMT, that's also a result even if TPS increased.
Main rule of repeatability: compare only comparable tests — same data, same workload profile, same interval and minimal external changes.
Common mistakes that lead to false conclusions
False conclusions often come when BIOS is changed as a bundle of "improvements" rather than as hypothesis tests. You may think performance improved but cannot reproduce it.
Typical mistakes:
- changing multiple options at once (NUMA, SMT and C-states together). If performance shifts, you won't know which change caused it or how to roll back safely;
- comparing tests in different conditions: different data, buffer sizes, cache warm-up, background tasks. On short runs a cache hit can produce a deceptively good number;
- looking only at average time or total TPS and ignoring tails. For databases p95/p99 matter because rare latency spikes affect users and request queuing;
- testing Turbo, power limits and power-saving without watching temperature and throttling. A server may show a fast start and then hit limits after 10–20 minutes, skewing results;
- drawing conclusions from a short benchmark without a long run. SMT and C-states need to be observed under sustained load and during maintenance windows, backups or typical background jobs.
Another trap is different BIOS profiles across identical servers in a cluster. Even a single mismatched setting (e.g., SMT on one node and off on another) complicates comparison and can cause uneven replication roles or latency differences. Agree on a unified profile and record firmware versions and the full parameter set, especially for homogeneous fleets.
Always verify that rollback actually applied. Sometimes profiles appear "saved" but some parameters remain in Auto, and you end up testing something different than intended.
Short checklist and next steps after choosing settings
Once you know which BIOS changes help a DBMS, lock them down so you don't lose stability after firmware updates or hardware replacement.
Quick checklist before and during tests
- save the baseline BIOS profile (screenshots or export) and agree on a reboot window;
- prepare a rollback plan: what to restore, in what order, who is responsible and the maximum acceptable downtime;
- change only one parameter at a time (NUMA, SMT/Hyper-Threading, C-states, etc.);
- do 3–5 repeats of the same test and record p95/p99, not just averages;
- monitor temperature, CPU frequencies and any error signs (reboots, timeouts, latency spikes).
After short runs, allow time for stability checks. Practical rule: a 2–3% synthetic gain that adds rare p99 tails under real load is usually a bad trade.
After selecting the "best" profile
Work ends when the result is documented and reproducible.
- do a final long run under typical load (overnight or during a maintenance window) and check logs for hardware errors and memory corrections;
- record the final profile as a standard: BIOS/firmware versions, list of toggled options, test conditions and final metrics;
- describe applicability boundaries: which databases and scenarios the profile fits (e.g., OLTP sensitive to latency) and where a different profile is needed;
- include the profile in change procedures: BIOS updates, CPU/memory replacement or rack moves should trigger re-checks of key metrics;
- standardize profiles by server model if you have many servers.
If you use GSE S200 Series servers or plan a fleet upgrade, it helps to agree on a baseline profile and validation order with GSE.kz (gse.kz) engineers, then keep consistent settings during firmware updates and component replacements.
If your team lacks time for methodical runs, hire a systems integrator to tailor a profile for your specific DBMS and real workload, not a "one-size-fits-all" setting.
FAQ
When does it make sense to change BIOS settings for a DBMS?
Usually — when performance is "uneven": the same load gives different TPS/QPS or response time jumps. If the OS and DBMS are already tuned and unexplained peaks remain in the tails (p95/p99), BIOS parameters can be the cause and a point of improvement.
What are the main risks of changing BIOS settings for databases?
The main risks are instability and worsening tail latencies. You can get rare errors that appear later, or an increase in p99 even if average TPS looks better. Another common risk is inability to quickly roll back if several settings were changed at once and the original values were not recorded.
What should be done before the first BIOS change to avoid regret?
Record hardware configuration and firmware versions, then save the current BIOS profile (to a file/slot) or take full screenshots of relevant sections. Next, decide which metrics you will compare and what degradation threshold is a stop signal. Plan a maintenance window with time for 2–3 reboots and log checks after each change.
How to test BIOS settings safely and avoid false conclusions?
Change one parameter at a time and reboot the server after each change. Run several identical runs and compare medians and spread, not the single best result. If an improvement does not repeat in at least two test series, consider the effect unproven and revert the setting.
Which NUMA settings in BIOS should be checked first?
Commonly tested options are whether NUMA is enabled or memory interleaving is on (which makes the system behave more like UMA). Then check whether p95/p99, overall throughput and the share of remote memory accesses change. A good sign is not only higher TPS but also reduced tail latencies under the same load and data set.
When does it make sense to disable SMT (Hyper-Threading) for a DBMS?
Test SMT when the database is CPU-bound and p95/p99 rises or response times become uneven under high load. In such cases, SMT Off sometimes evens out latencies even if peak throughput drops slightly. If the workload is mixed with much I/O wait, SMT On can increase throughput without harming tails — this must be confirmed with measurements.
What to do with C-states if latency matters?
For OLTP with many short queries, deep C-states can introduce micro-delays when cores wake up and make responses less predictable. Usually compare simple modes: C-states enabled vs disabled and the platform power profile in BIOS if available. If disabling C-states improves p99 but causes overheating and throttling on long runs, the overall result may be worse — so long tests are necessary.
Why can Turbo and power limits spoil DBMS tests?
Turbo can give a fast start but then frequencies may fall due to temperature or power limits, making test results unstable. Look at frequency stability under prolonged load, not just TPS. If you see strong frequency drops after 10–20 minutes, address cooling and power limits before drawing conclusions about DBMS parameters.
Which memory and RAS settings should be adjusted with maximum caution?
Memory frequency and modes can give a small gain but reduce stability margin, especially with a full complement of DIMMs. ECC and other RAS functions are usually better left enabled because reliability often matters more to DBs than a few percent in synthetic benchmarks. If corrected ECC errors increase or uncorrectable errors appear after a change, revert immediately even if benchmarks look better.
Which metrics should you collect and how to record results so they are repeatable?
Minimum set — throughput (TPS/QPS), p95 and p99 latency, CPU load and frequencies, signs of I/O wait, temperatures and possible throttling. Keep a simple run log with the exact BIOS parameter changed, test duration, data size and firmware versions. One table is better than scattered notes: it lets you reproduce results and quickly spot where degradation began.