Why do I need a checklist for a DB/DWH server if the specs already look powerful?

A checklist is needed to get not just an "fast" system on average, but a stable and repeatable one in terms of latency and throughput. It helps agree acceptance metrics in advance and catch items that look good on paper but cause drops and "sawtooth" performance under real load.

Where to begin when selecting a configuration so you don't argue "by sight"?

Start by locking down the profile: OLTP, DWH or mixed, read/write ratio and parallelism. Then pick 3–5 representative operations and agree acceptance metrics: for OLTP — TPS and p95/p99 latencies; for DWH — report times and scan throughput; for loads — batch time and behavior under concurrent jobs.

What's more important for DBs: more cores or higher CPU frequency?

For OLTP, single-core frequency and consistent latency usually matter more than raw core count. For DWH, more cores and threads help, but only if memory and storage aren't the bottleneck—otherwise extra cores will idle and provide no benefit.

Why can NUMA "break" performance and how do I know it's the case?

NUMA matters in dual-socket servers: a thread runs faster with the memory local to its CPU than with memory on the other socket. If memory allocation and thread placement are skewed, you get fluctuating latencies and poor scaling as parallelism grows. Check memory symmetry and behavior under load to see if NUMA is your issue.

How to choose and verify DDR5 memory for DB and DWH?

Look not only at total capacity, but at the actual frequency in the chosen population of modules—filling more slots can reduce frequency. For databases, consistent latency and memory bandwidth are key, so verify channel symmetry and practical frequency rather than relying solely on module specs.

What disk layout is recommended for DBs: what must be separated?

Separate roles: data, logs, temp/scratch and backups should not compete in one pool, otherwise peaks in one workload will impact others. Prioritize measuring and controlling latency and its stability, since that often causes sudden stalls and missed load windows.

When to choose RAID and when HBA/JBOD for NVMe in a DB server?

HBA passthrough is often chosen when redundancy and array management are handled at the OS, filesystem or DB level and direct access to drives is needed. Hardware RAID suits simple schemes (for example, mirrors for the OS) or when the team is used to diagnosing and maintaining arrays on the controller; the key is to decide in advance where monitoring and recovery will be done.

What network mistakes are most common in DB/DWH projects?

Estimate concurrent traffic: client requests, replication, backups and ETL can interfere on a single port. Practically, separate flows by ports or VLANs and in acceptance measure not only throughput but latency stability under load—"link up" alone doesn't guarantee predictable DB operation.

What must be checked in BIOS/firmware before installing the DB?

Agree on BIOS, iDRAC and firmware versions for NVMe and NICs, and set the power profile so frequencies and latencies don't "wander" due to hidden power settings. Record these versions and key parameters at acceptance so you can compare later and quickly identify changes.

What does a normal acceptance for a DB/DWH server look like and what about ongoing support?

At minimum: a short set of disk, network and CPU/RAM tests plus a mini load run resembling your profile measuring p95/p99 latencies. If you involve an integrator, formalize an acceptance protocol and responsibilities for updates and support; in Kazakhstan this is often handled through GSE.kz as manufacturer and integrator to keep the configuration consistent in operation.

Dell PowerEdge R760 Configuration Checklist for DB and DWH

Why a checklist specifically for DB and DWH

A server for DB and DWH can look "powerful on paper" but show odd drops in real use. Sometimes everything is fast, other times queries suddenly stall for seconds; a nightly load fits its window one time and misses it the next. A checklist is not a formality — it's how you get a stable, predictable system you can accept from a vendor and maintain.

Typical symptoms of incorrect configuration don't appear immediately but are expensive to fix: I/O latency spikes, IOPS drops under parallel requests, "sawtooth" read/write speeds, storage queues, and CPU spikes with the same data. For example, a DWH may load data steadily for 20 minutes, then suddenly hit an NVMe latency cliff or NUMA imbalance and stretch the job twice as long.

For databases the two most important things are predictable latency and stable throughput. "Many cores" doesn't always help. If memory runs at the wrong frequency, channels are unevenly populated, disks are assembled "haphazardly", or the network can't hold latency under load, performance will fluctuate even with the same vCPU count and gigabytes.

Before procurement agree on three sets of requirements. Without them, any conversation about NUMA, memory, NVMe, RAID/HBA and network is guessing:

Workload profile: OLTP, analytics, mixed mode, read/write ratio, parallelism, working set sizes.
Growth: how fast data, indexes and temp tables grow, and how many concurrent users or ETL jobs there will be.
Resilience: what counts as downtime, whether hot-swap is required, mirroring, backups and how fast recovery must be.

When these inputs are fixed, hardware selection and tuning stop being "magic" and become verifiable acceptance items.

Start by fixing the workload profile and goals

Before assembling a Dell PowerEdge R760 for DB and DWH, describe what the server will do daily. The same "powerful" hardware can produce very different results if the workload profile is missed.

OLTP usually hits latency limits: many short transactions, frequent writes, sensitivity to NUMA and single-core speed. DWH prefers broad parallelism: long queries, large table scans, heavy reads, and intensive sorts and aggregations. A mixed profile (presentation layers and transactions on one node) is the riskiest: it's easy to get conflicts over CPU cache, memory and disk queues.

To avoid arguing "by sight", agree acceptance metrics in advance:

OLTP: TPS and 95/99th percentile latency on key operations.
DWH: scan throughput (GB/s) and runtime of typical reports.
Data loads: batch time and stability under parallel jobs.
Backups: whether they fit the window and how they affect production load.

Also fix site constraints. Power and cooling matter for reliability and for predictable CPU and memory frequencies. Rack space, noise, availability of 10/25/100GbE ports and network segmentation requirements influence the final design. In regulated sectors check procurement and delivery rules in advance: sometimes that favors locally produced systems and integrator services to satisfy procurement and support rules.

A simple target: if you plan a 6-hour nightly load and daytime reports, the goal might be "batch done by 05:00 at parallelism N, daytime report under 2 minutes." Then pick CPU, memory, NVMe and network to meet concrete numbers, not just the maximum configuration.

CPU: frequency, cores and growth plan

Choosing CPU for DB and DWH on a Dell PowerEdge R760 is usually a trade-off: more cores for parallel tasks or higher frequency for fast single-threaded queries and latency-sensitive operations. Plan for current load and expected growth over 12–24 months.

If you have many short queries, active transactions and tight response time requirements with imperfect query plans, higher per-core frequency and fewer cores often win. If the main profile is long analytical queries, scans and aggregates, total core count and the ability to sustain many threads matters more.

Estimating CPU needs for DB and background work

CPU in a DWH is consumed not only by user queries. Background work often eats CPU: loads (ETL/ELT), index builds/rebuilds, compression, statistics, partitioning, encryption and replication. Split CPU budget in advance: how much for interactive queries and how much for nightly jobs that can spill into daytime.

A quick checklist:

Are ETL and reports running at the same time?
Do some queries require predictable latency (SLA)?
Will you use compression, encryption or heavy functions extensively?
Do you need headroom for data and user growth?

Vertical scale or cluster

Decide in advance whether you will scale vertically (add CPU/memory to a single server) or horizontally via cluster/sharding (add nodes). Vertical scaling is simpler operationally but limited by a single-node ceiling and can be costlier to upgrade. Clustering offers flexibility and resilience but requires discipline in networking, storage and administration.

Ask your integrator for two scenarios: "maximum single-node" and "designed for multiple nodes" so you can compare price and growth risks.

NUMA: basics and practical settings

NUMA is simply "memory close to your CPU." In a dual-socket server each CPU has its local memory. Access to local memory is faster than access to memory attached to the other CPU. This is critical for DB and DWH: if a thread frequently uses "remote" memory, latency rises and performance becomes less predictable.

NUMA usually starts to hurt with parallel queries, large scans and mixed loads (ETL and reports running together). Typical symptoms: floating latencies, unexpected response-time spikes, drops as thread count grows, and cases where CPUs look busy yet adding cores gives little benefit.

Affinity (pinning) is useful when repeatability is required: for example, one NUMA node for the main DB and another for background loads. It's not always mandatory, but often helps with strict SLAs and node imbalances.

Quick OS-level checks:

lscpu | egrep "Socket|NUMA"
numactl --hardware

What to look for numerically:

Memory distributed symmetrically across NUMA nodes (not "node 0 90% RAM, node 1 almost empty").
No persistent skew under load: one node saturated in CPU and memory while the other is idle.
If BIOS has Node Interleaving enabled, NUMA is effectively blurred. For DBs it's often better to leave NUMA visible and manage placement deliberately.

Example: a DWH doing parallel NVMe reads and sorts. If some workers run on CPU1 but memory is mostly allocated on CPU0, latency rises and query time becomes "sawtooth" across runs. Symmetric memory and sensible affinity usually remove these surprises.

DDR5 memory: capacity, frequency and channel layout

For DB and DWH memory constraints are about capacity and speed. If queries work mostly in RAM, extra gigabytes may not help — frequency and latency matter more. Faster memory improves scans, hash joins, sorts and index builds.

The trade-off is simple: more modules usually mean more capacity but can reduce frequency. On many platforms adding two modules per channel forces the memory controller to run at a lower frequency than with a single module per channel. In practice this looks like: you add sticks, gain RAM, but some queries slow down, especially those that move lots of data in memory.

Estimating capacity for DB and DWH:

For OLTP estimate the working set (hot tables and indexes). A good goal is to fit it in cache plus some room for internal structures.
For DWH add memory for parallelism: sorts, hash aggregates, temp tables and ETL. The more concurrent heavy queries, the more headroom you need.
Leave a reserve for the OS and spikes. Practically, keep 15–25% free to avoid swapping and memory starvation at inopportune times.

Example: a server for nightly loads and daytime reports. During the day predictable latency matters, so start with correct channel layout and higher frequency. At night some reserve can be used for parallel sorts, but not at the cost of persistent memory drop-offs.

Simple acceptance checks:

Verify actual memory frequency in BIOS/iDRAC and in the OS, not only module specs.
Ensure modules are evenly distributed across CPU channels (symmetry matters for bandwidth).
Check for unexpected frequency drops when all slots are populated.
Run a short memory read/write test and compare to expected levels for the chosen layout.
Record settings and results in the acceptance report for future comparison.

Storage: RAID or HBA, NVMe and separating workloads

Performance audit before upgrade

We'll check your current server for NUMA imbalance, memory frequency, I/O latency and network bottlenecks.

Order an audit

Storage often breaks expectations not by peak IOPS but by latency and predictability. It's important to separate roles and avoid mixing different workload types in one pool.

Minimum separation logic:

OS and system partitions: separate.
Data: dedicated NVMe pool for the main dataset.
Logs (WAL/redo/transaction log): separate pool prioritized for stable low write latency.
Temp/scratch (tempdb, sort, spill): separate from data and logs.
Backups: separate media or pool so they don't throttle production.

Choosing RAID vs HBA passthrough depends on where you want to manage redundancy. If you use software RAID/Storage Spaces/ZFS or DB-level mechanisms, HBA passthrough is common so the OS sees raw drives. Hardware RAID fits when a simple scheme is wanted (e.g., OS mirrors) or if the team expects to manage arrays at the controller level.

For NVMe it's not only drive class but how you assemble performance. For DWH loads and sorts parallelism matters: several mid-sized NVMe drives can give smoother queue distribution than one large drive. For logs, write endurance (DWPD/TBW) and stable write latency matter more than max read speed.

Example: use two SSDs in mirror for OS, a separate pair of NVMe for logs, and 4–8 NVMe for data and temp so heavy sorts don't affect the log pool.

Before ordering and at acceptance ask the integrator: what's the write endurance of chosen NVMe and how long will it last with your volumes, how is wear treated in warranty, what controller mode is planned (RAID/HBA) and where is the failure monitoring point, expected write latency for the log volume under typical load, and how hot-swap and pool rebuild are organized.

Network: bandwidth, segmentation and latency stability

Network is a "quiet" limiter: the server is fast, disks are fast, but loads and replication may hit a port, switch or configuration. Decide how much concurrent traffic you'll have and how to split it.

Choose speed and port count by role. 10GbE often suffices for applications accessing a single medium-load DB. For active nightly DWH loads, replication and fast backups, 25GbE (2 ports) or 100GbE makes more sense if you have a large cluster and tight windows. Prefer separate physical ports for different traffic rather than squeezing everything onto one channel.

Segregating networks reduces interference and simplifies diagnostics:

Client application traffic to the DB.
Replication or inter-node traffic (if any).
Backup/restore and heavy loads (ETL).
Management (iDRAC/host management).

To avoid surprising bottlenecks check basic settings. Use VLANs for isolation, not aesthetics. LACP helps for resilience and aggregate bandwidth but a single large flow won't necessarily speed up—parallel flows benefit more. MTU 9000 (jumbo frames) is useful only if it's consistent from server to destination; otherwise you risk strange losses and throughput drops.

At acceptance measure real numbers, not just "link up." Test throughput in both directions between server and target (storage, replica, backup server) and check latency stability under load. A practical method: run parallel throughput tests while monitoring ping on the same segment for latency and loss. If ping fluctuates, loads and queries will be unstable even with nice gigabit specs.

BIOS, firmware and basic OS settings

Before installing the DB and running tests, bring the server to a predictable state. The same hardware can behave differently because of BIOS version, CPU microcode, NVMe firmware and NIC drivers.

First update core components to agreed versions and document them in the acceptance report:

BIOS and CPU microcode.
iDRAC and Lifecycle Controller.
RAID/HBA and backplane firmware.
NVMe SSD firmware and storage drivers in the OS.
NIC firmware, drivers and offload options.

Then check the power profile. For DB and DWH choose Performance mode and disable aggressive power savings (deep C-states, aggressive power limits) so latency is steadier. The trade-off is higher power and heat; ensure sufficient cooling and power headroom.

If virtualization is planned, enable VT-x/VT-d but don't rely on scheduler automation. For critical DBs reserve vCPU and memory, enable huge pages, pin by NUMA nodes and avoid resource overcommit on the host. This reduces the risk of floating latencies under peak.

Also set up fact collection so you won't have to guess where a bottleneck came from. Minimum set: CPU load by socket, frequencies and throttling, NUMA local/remote memory stats, disk latency (p95/p99), IO queue depth, NVMe errors, NIC load and drops, temperature and power draw. Capture these metrics before and after load tests and save them with firmware and driver versions.

Short acceptance load test plan (step-by-step)

From requirements to working system

We will document OLTP or DWH profile, agree metrics and assemble a solution for concrete numbers.

Start project

Acceptance for DB and DWH is not a benchmark for numbers but a quick check that the build matches spec and there are no surprises under load: latency cliffs, disk errors or network problems.

Verify hardware against the order: CPU model, socket count, RAM amount and channel layout, NVMe composition, NIC model and speed, BIOS/iDRAC and firmware versions.
Run a short disk load for reads and writes (separately). Important are not only speed but latency stability: average and p95/p99, plus errors in logs.
Check memory and NUMA. Ensure test processes use local memory and remote memory is noticeably slower. If there's no difference or it's odd, investigate settings.
Run a network test for throughput and loss. For DWH stability matters: no drops, retransmits or latency spikes.
Execute a mini test that resembles your profile: a short data load plus typical query set for 15–60 minutes, or synthetic workload matching read/write ratio and parallelism.

Run tests under identical conditions: same power profile, BIOS modes, OS and driver versions, and without extraneous background tasks.

Record in the acceptance document:

Configuration and firmware versions, key BIOS/OS parameters (including NUMA policy).
Disk metrics: IOPS/MBps, average and p95/p99 latency, temperature and errors.
CPU/RAM metrics: load, frequencies, signs of throttling, NUMA stats.
Network metrics: throughput, loss, retransmits, latency.
Test conditions: tools and versions, duration, data set size, number of threads.

Common costly mistakes

The most painful case is a server that is "powerful by spec" but hits a bottleneck in the real DB or DWH that could have been avoided at ordering time.

Mistake 1: chasing cores without balance

"More cores" doesn't always mean "faster queries." If memory is insufficient or slow, or storage can't keep up, cores will idle. This is pronounced in ETL and heavy joins where memory bandwidth and steady I/O matter.

Mistake 2: mixing workloads on the same disks

If transaction logs, temp, data and backups share one pool, load peaks interfere. A nightly load can severely degrade daytime response if these roles are not separated.

Other frequent errors:

Buying max cores but skimping on memory and disks so performance doesn't improve.
Putting logs and data together and then blaming the DB for latency spikes.
Using RAID for NVMe by habit without thinking how to recover from a disk failure and how long rebuilds will take.
Installing memory modules of different sizes or speeds so the whole set runs at a lower frequency.
Leaving the network "as is": one port, shared traffic and no real throughput or latency testing.

Example: a DWH loads at night and reports run in daytime. If load and reports share the same disk pool and network interface, users complain even though resources seemed "enough."

At acceptance lock down what is critical for you (IOPS, throughput, latency, CPU) and what settings are mandatory. Otherwise you'll treat symptoms with blind upgrades.

Quick checklist before purchase and before commissioning

Compare vertical scale vs cluster

We will propose options: one powerful node or scaling to multiple nodes considering 12–24 month growth.

Get proposal

This short list filters configurations that look good on paper but later fail due to NUMA, memory or storage issues. Walk it twice: before ordering and at acceptance.

Before purchase: items to fix in the order

CPU: exact model, 1 or 2 sockets, target frequency for your workload, performance profile (no unexpected power-saving modes).
NUMA: how the DB or hypervisor will run (one instance per socket, multiple instances, VMs) and what affinity rules you can actually maintain.
RAM: total capacity and actual DDR5 frequency in the chosen layout. Check channel symmetry and avoid skew.
Storage: roles of volumes (data, logs, temp, backups), what will be RAID and what HBA/JBOD, and hot-swap capability. For NVMe require wear monitoring.
Network: required port speed and medium (fiber/copper), separated networks (clients, replication, backups, management) and target latency metrics.

Do not change architecture "on site" before commissioning. A typical scenario: a DWH server for heavy nightly loads suddenly misses its window because memory runs at a lower frequency due to incorrect slot population.

Before launch: what to confirm on site

Ask the responsible engineer for a single factsheet: BIOS/iDRAC and controller firmware versions, CPU profiles enabled, confirmed NUMA mode, actual memory frequency and capacity, RAID/HBA mode and NVMe state (SMART, wear percent).

Then run a short acceptance test (30–60 minutes): basic network check (throughput and latency stability), disk tests (IOPS/latency per volume type) and CPU/RAM stress (without throttling). Save results and config in documentation for future comparison.

Example scenario: single server for DWH and regular loads

Imagine a Dell PowerEdge R760 used for DWH: nightly bulk loads and transformations, daytime report viewing, and weekly refreshes. Two things matter here: predictable write speed at night and stable read latency during the day.

Separate storage and traffic roles in advance. The most common failure is logs and temp files landing on the same array as main data.

A practical separation scheme to discuss with the vendor:

DWH data: separate NVMe pool (priority — capacity and read performance).
Logs (redo/transaction): separate NVMe (priority — stable writes and low latency).
Temp/scratch (sorts, joins, staging): separate NVMe pool (priority — high mixed IOPS).
Backups: separate storage outside the server or separate drives so backups don't consume NVMe from production.
Replication/exports: separate NIC or VLAN so nightly load doesn't affect reports.

How to choose between "faster" and "bigger": if nightly windows are tight and transforms are heavy, faster NVMe and correct memory frequency/layout usually win. If the main risk is data growth, plan capacity and a clear expansion path.

Mini acceptance plan for this scenario (3–4 checks that answer "ready/not ready"):

Sequential write test for nightly loads on the data pool and separately on the log pool.
Mixed read/write test on temp/scratch (simulate sorts and temp tables).
Parallel run: reporting reads plus background load to observe latency drops.
Network run: actual throughput and latency stability on your VLAN/MTU setup.

Next steps: formalize configuration and acceptance to avoid risks

To avoid "it seems to work", document inputs on one page: DB type, current DB size and growth rate, maintenance windows and service objectives (RPO/RTO). This filters out bad compromises (for example, skimping on memory or network).

Then formalize configuration and acceptance so they can be repeated and verified. Two documents help:

Configuration table: CPU/NUMA plan, DDR5 capacity and frequencies, disk scheme (RAID or HBA), NVMe roles (data, log, temp), network (10/25/100GbE) and port assignments.
Acceptance protocol: list of tests and metrics with thresholds (IOPS/MB/s, p95/p99 latency, CPU%, network loss, typical query time, batch load time) and pass/fail conditions.

An example threshold clear to all: "nightly batch must complete within 2 hours, and p95 write latency on the log volume must not exceed X ms at specified concurrency." Values X and test parameters come from your workload profile, not marketing numbers.

Also decide support to prevent configuration drift. Define who and how applies BIOS/iDRAC/controller updates and on what schedule, what monitoring is in place (capacity, disk wear, latency, temperature) and alerts, disk replacement and rebuild procedures, 24/7 response and escalation.

If you need an integrator, cover not only delivery but assembly, acceptance testing and ongoing support. In Kazakhstan it's often convenient to close this with GSE.kz: as manufacturer and integrator they can provide supply and support under contract so the configuration stays consistent over time.