What exactly does SLA mean for virtualization in the public sector?

SLA in virtualization is an agreement on acceptable service unavailability and on how you recover from incidents. In practice it almost always reduces to three metrics: availability, RTO (recovery time) and RPO (acceptable data loss).

Why can't SLA be replaced with 'just a powerful server'?

Performance answers 'how fast it runs', while SLA answers 'how long you can be down and how quickly you recover'. Fast CPUs and high IOPS won't save you if there's a single point of failure, for example a single PSU, the hypervisor system disk or a single network link.

How to convert SLA into clear configuration requirements?

First define what counts as downtime (a host, a cluster, or a specific service), then specify RTO and RPO and planned maintenance windows. After that, tie every configuration option to a concrete downtime risk or to reducing recovery time — otherwise you'll end up with a list of options that don't help the SLA.

Which DL380 Gen11 options really reduce downtime risk?

Most often it's the RAID controller with protected write cache, a clear disk layout with hot-plug and, if needed, hot-spare, sufficient network ports for redundancy and role separation, and two PSUs in 1+1 mode. These items either prevent downtime for common failures or speed up recovery.

Which RAID to choose for the hypervisor and for VM datastores?

For hypervisor boot it's common to use RAID 1 on two identical SSDs so a single disk failure doesn't stop the host. For datastore choose based on what you fear more: predictable latency and fast rebuilds (RAID 10) or capacity and tolerance for two failures (RAID 6). RAID 5 only makes sense if you've assessed rebuild windows in advance.

Why is protected write cache needed on the RAID controller?

If the controller has a write cache, it must be protected against power loss; otherwise you risk data loss or long array checks after a power failure. Specify protected write cache and that it's enabled — this noticeably affects RPO and recovery time after an unexpected shutdown.

When is NVMe truly justified in virtualization?

NVMe is useful where there are many small random operations and constant writes: hot datastores, logs, VDI or busy databases. If most VMs are 'cold', it's better to first close basic risks around power, disks and network because NVMe can increase complexity and cost without improving availability.

How many network ports are needed per host and how to do redundancy correctly?

As a rule, have at least two physical ports with redundancy and clear role separation, otherwise a single link failure easily becomes downtime. Complex aggregation schemes should be used only if the team can reliably support them across the whole chain, because misconfiguration can itself cause incidents.

Why do '2 PSUs' not guarantee power fault tolerance?

Two PSUs in 1+1 protect against a PSU failure, but they don’t protect against a single power feed to the rack being lost. Real resilience requires two independent power paths A/B to the rack and proper connection of each PSU to its own path; otherwise you formally paid for redundancy but didn't get it.

How not to bloat the proposal while still meeting SLA in procurement?

Require measurable parameters instead of vague phrases: RAID levels by pool, disk class and type, presence of hot-spare, number and speed of physical ports, redundancy scheme, 2x PSU 1+1 and two A/B inputs. Also fix 24/7 support and responsibilities in documents if you need them — these points should be agreed beforehand.

HPE ProLiant DL380 Gen11 for Virtualization: SLA-focused Options

What SLA means for virtualization in the public sector

SLA for virtualization in the public sector is not about "faster". It's about how long services may be unavailable, how quickly they must be restored, and how much data loss is acceptable. Usually SLA boils down to three metrics: availability (uptime), RTO (recovery time) and RPO (recovery point objective).

Don't confuse SLA with performance. High IOPS or a powerful CPU will speed up virtual machines, but they won't prevent downtime if a single "critical" component fails: power, the hypervisor's system disk or a single network port.

The public sector has its own specifics: approved regulations, scheduled maintenance windows, and requirements for logging and reporting. For example, "updates on Saturdays 02:00–06:00" is an acceptable planned downtime. Unexpected downtime often requires an incident report and investigation, so it's better to eliminate risks with the right architecture in advance.

In virtualization a weak link affects the whole stack. One server might seem "non-critical", but if it runs AD, an accounting database or VDI, its failure becomes a service outage. So when choosing HPE ProLiant DL380 Gen11, the point is not "every possible option", but only those that reduce the chance of downtime or shorten recovery.

To turn SLA into configuration requirements, break it into simple questions:

How much downtime is acceptable and what counts as downtime (host, service, cluster)?
What is the required RTO (how quickly must service be restored)?
What is the acceptable RPO (how much data loss is tolerable)?
What are planned maintenance windows and what must remain online?
What evidence is required (logs, serial numbers, certificates) and who produces it?

With answers, configuration becomes measures against specific risks — not a long shopping list.

DL380 Gen11: options that actually change downtime risk

The same server can yield different real uptimes depending on options. For virtualization, what's important is not "maximum specs" but how safely you can survive a disk, port or PSU failure without stopping the host.

What really reduces downtime risk

SLA is most affected by components that provide resilience or speed recovery:

RAID controller and protection for write cache (so writes aren't "lost" on power loss).
Disk layout: RAID level, same class of disks, hot-spare.
Network: enough physical ports to separate roles and provide redundancy.
NVMe: clear purpose (boot or datastore) and hot-replacement without long downtime.
Power and cooling: two PSUs, correct power rating and behavior on failure.

Unnecessary items usually appear as "just in case": maximum NICs "to have them", NVMe "everywhere", the most advanced RAID, larger-than-needed disks. These increase cost and complicate support without always reducing downtime. For example: if you have a single uplink on the switch, a faster port without redundancy changes almost nothing.

Specify requirements with numbers, not vague words. Not "power redundancy", but "2x PSU, 1+1 mode". Not "fast network", but "minimum 2x 10/25GbE per host for VM traffic + separate management". Not "reliable RAID", but "controller with protected write cache and hot-spare: 1 hot-spare per node".

Agree on acceptable trade-offs in advance: what's more important for you — survive a disk failure without downtime, survive a port failure, stay within budget, or shorten the rebuild window. This quickly removes "maximum" from proposals and leaves only what supports the SLA and smooth support.

RAID and disks: reduce downtime, don't just raise IOPS

For SLA predictability matters more than raw speed: how quickly you survive a disk failure and how long the system runs in a degraded mode.

Typically the hypervisor system disk is separated from VM storage. For the system, RAID 1 on two identical SSDs is common: simple and repairable. For datastores choose according to what you fear more — performance loss during degradation or long rebuilds.

Practical guidance:

RAID 1: hypervisor boot and system partitions.
RAID 10: critical VMs and databases where stable latency and fast rebuild matter.
RAID 6: file storage and less sensitive workloads where capacity and tolerance for two failures matter more.
RAID 5: only if you've calculated the rebuild window and accepted the risk.

Hardware RAID is often easier to operate, but for SLA what's critical is the write cache. If there is a cache, it needs power protection. Otherwise a power loss can cause data loss or long array checks.

Hot-spare shortens reaction time: the drive is already in the rack and rebuild starts immediately. The larger the disks, the longer the rebuild and the longer the window where a second failure can cause downtime. Avoid mixing different types and sizes of disks in one array: differing speeds and endurance often cause instability and odd errors.

To prevent vendors from supplying a "worse equivalent", lock down specifics: RAID levels per pool (system/VM), disk class and type, protected write cache enabled, number of hot-spares, and expected disk replacement time as part of support obligations.

NVMe: where it helps and where it complicates support

NVMe in DL380 Gen11 is chosen not for benchmark numbers but for low and consistent latencies under load. The biggest SLA benefit appears where there are many small operations and constant writes: databases, VDI, logs and the hottest datastores.

NVMe makes sense when a small portion of VMs (say 10–20%) constantly saturates disks and drags the whole cluster down. A fast tier reduces storage queues and helps restore services quicker after peaks.

A common unnecessary scenario is 'cold' VMs, archives or monthly peaks. There it's more reliable to invest in redundancy, monitoring and a clear disk scheme than to complicate the drive fleet.

Boot from NVMe: convenient, agree in advance

Booting the hypervisor from NVMe speeds up startup and can remove the need for separate SATA SSDs. The usual downside is support. Agree in advance what counts as the 'system' disk, how recovery is done on failure, and whether you use mirrored NVMe for boot or a separate module for the OS.

Endurance and temperature

For NVMe pay attention to write endurance (TBW or DWPD) and how it matches your workload. Ask the supplier for the endurance class and how wear is treated under warranty.

A second practical point is temperature. In a dense chassis a drive can throttle, and the "fastest" NVMe can drop performance under sustained peaks. For SLA this looks like random latency spikes, so correct drive cages, airflow and temperature monitoring at server level are important.

Describe NVMe in the proposal clearly: quantity, capacity, interface, endurance class, purpose (boot or datastore), hot-swap requirements and monitoring.

Network adapters and channel redundancy: what matters for SLA

Failure tests before deployment

We will organize lab tests: disk, PSU, link failures and recovery scenarios.

Schedule test

Downtime in virtualization more often comes from a failed port, a poor redundancy scheme or vague procurement requirements than from a lack of gigabits. Describe the network so it's clear what is redundant, how traffic is separated and what exactly must be supplied and configured.

A simple port scale covers most cases:

2 ports — minimum for single-link resilience.
4 ports — practical standard, to separate roles across two pairs.
6–8 ports — makes sense for separate networks for storage, backup, DMZ or strict security requirements.

Choose 10GbE vs 25GbE by two criteria: real peaks during migrations/backups/replication and readiness to support the full chain (switches, transceivers/cables, spare parts). 25GbE provides headroom only if the switching fabric is prepared.

Split traffic at least logically: management, storage, VM traffic, migrations. Then backup traffic won't saturate the network and kill availability.

Be cautious with bonding/LACP: active-active gives throughput but adds configuration risks (LACP, hashing, port profiles). For SLA it's often safer to use active-standby at the hypervisor level and only enable aggregation where the team has proven expertise.

A separate management port (iLO) is not an "extra". It speeds recovery: you can access the server if the OS fails, check logs, reboot the host and avoid a site visit.

To avoid "any adapter of suitable speed", lock down: speed and number of physical ports per server, redundancy scheme (active-standby or LACP and where it's configured), optics/cable requirements and a separate management port with access details.

Power and cooling redundancy: simple solutions against downtime

Many "sudden outages" start with power. Redundant PSUs and fans are among the cheapest ways to reduce the risk of host shutdown and protect SLA.

A 1+1 PSU scheme means the server survives a single PSU failure without shutting down. But it doesn't help if there is a single power feed to the rack, one circuit breaker or one PDU: power will be lost to both PSUs at once. Real redundancy requires two independent power paths A/B.

A practical setup:

2 PSUs in 1+1 mode (same power rating).
2 independent A/B feeds to the rack, separate breakers.
2 PDUs (or PDUs on different sources), separated across inputs.
Documented cabling: which cord goes to which input and PDU.

Size power for realistic peaks, not for excessive headroom. If a server truly consumes 450–550 W, larger PSUs are needed only for specific CPU, RAM, NVMe and card combinations. A common mistake is "we bought larger PSUs" and then hit rack or UPS limits.

For regulations it's important that PSUs and fans support hot-swap, so maintenance doesn't cause downtime. In the proposal specify number and power of PSUs, 1+1 mode, cable/connector types and the requirement for two independent A/B inputs on the site side.

Also add operational discipline: power and temperature monitoring, checks that both PSUs are on different feeds, and a test procedure for simulating feed failure (disable A, then B) without stopping the host.

How to build a configuration without inflating the proposal

The most reliable way to avoid bloat is to tie every option to a downtime risk and an explicit assumption. If an option neither reduces failure probability nor shortens recovery, remove it.

First record SLA and assumptions: allowable downtime per month, maintenance windows, RTO/RPO, and what counts as downtime (one VM, a host or the cluster). Then describe load with simple numbers: current and 12–18 month VM count, total vCPU and RAM, and 3–5 most critical services.

Next — the resilience strategy. A single host is only for non-critical tasks. For the public sector you'll usually need a cluster and at least N+1 to survive one server failure without service interruption.

With that written, hardware requirements become simpler:

Storage: choose RAID by rebuild time. For hypervisor OS — mirroring. For VM datastore — RAID with hot-plug and hot-spare if needed.
Network: count ports by role (management, VM traffic, storage, migrations) and provide two independent paths.
Power: 1+1 PSUs and two inputs. If a second input isn't available, note that as a risk to the SLA.

In the proposal, add a short reason next to each option (one line: "needed to survive disk/PSU/link failure"). That's better than stocking extra controllers and drives with no clear role.

Common mistakes and procurement traps for the public sector

Platform comparison by SLA

We will compare solutions by availability, recovery time and support, not by 'maximum' options.

Compare options

The most frequent problem is "choose the maximum" without linking to SLA. The budget grows, but downtime risk remains because actual failure points aren't addressed.

Overpayment often goes to network and disks. For example, specifying 25GbE everywhere while the real bottleneck is storage or backup. Or installing NVMe for all VMs without specifying replacement procedures and SLAs for failed drives.

Typical traps:

a single path to storage or a single power feed (any failure becomes downtime);
RAID described in general terms, no hot-spare and no rebuild time assessment;
not enough ports for separate networks, ending up funneling everything through one pipe;
speeds (10/25GbE) specified but no redundancy scheme or switch port requirements;
two PSUs present but both connected to the same PDU or UPS.

Also be careful mixing server requirements and site infrastructure. A correctly built server won't help if switches, PDUs and UPS don't provide the needed redundancy. In the proposal separate "what is included in the server" and "what the site must provide".

Avoid vague terms like "no worse than", "equivalent" or "analog". Without specifics they turn into "almost the same". Better to fix measurable things: number of ports, interface types, presence of hot-spare, number of PSUs and separation of inputs, and what's included in commissioning and support.

Short checklist before finalizing configuration

Before agreement, check that SLA is translated into clear technical requirements.

SLA is measurable: service, availability threshold, RTO/RPO, and who records the incident.
No single point of failure where it affects SLA: disks, network, power.
It's clear where VMs, databases and logs live, where backups go and how restore is tested.
Compatibility confirmed in advance: RAID controller and modes, disk types, NICs, transceivers/cables, PSUs.
An operational plan exists: what can be changed without downtime (disk, PSU, NIC) and what requires a window (e.g. firmware updates).

If procurement is for the public sector, lock into documents who provides 24/7 support, how fast replacements are delivered and where spare parts are stored.

Example: virtualization cluster for an agency without extra options

Configuration calculation for SLA

We will calculate N+1, ports, RAID and power for your virtualization without unnecessary options.

Request calculation

Inputs: 3 identical HPE ProLiant DL380 Gen11 hosts, typical services (AD/DNS, file services, 1–2 business applications, monitoring, backups). Requirement: survive one host failure without data loss and have a clear recovery procedure.

Don't start with top CPU clocks — choose core and memory capacity. In the public sector predictability and headroom matter more than peak frequency. Typical approach: dual mid-range CPUs and invest savings into RAM. Memory limits are hit earlier than CPU in many cases, so plan extra RAM.

Disk logic: separate boot and data plus hot-swap. Boot — mirrored RAID 1 so system disk failures don't affect datastore. For VM storage — RAID 10 for mixed loads and fast rebuilds, or RAID 6 if capacity matters and higher latency is acceptable. Hot-spare makes sense if you can't guarantee same-day drive replacement.

Network for SLA relies on two independent paths: not just two ports but two switches, plus role separation (management, migrations, storage/replication, VM access).

Short spec (what really affects downtime):

3 hosts: identical CPU, RAM with headroom for N+1, free slots for growth.
Boot: 2x SSD in RAID 1 (hardware RAID), hot-plug.
VM datastore: RAID 10 or RAID 6 + hot-spare as required, controller with cache and cache protection.
Network: minimum 2x 10/25GbE per host + separation across two switches, VLANs by role.
Power: 2x PSU 1+1, two power inputs, A/B scheme at PDU/UPS side.

When an integrator prepares a proposal, ask them to specify these parameters rather than long lists of secondary options that don't change downtime risk.

Next steps: validate, agree and organize support

Before purchase and installation run a short test on one node or a lab. The goal isn't synthetic benchmarks but to ensure chosen options provide the desired availability and clear recovery.

Quick lab checks

Test scenarios that break SLA:

reboot and firmware updates (downtime, rollback availability);
disk failure (degradation behavior, rebuild duration, VM impact);
PSU or input failure (can the server sustain load, are alarms raised);
loss of a port or switch (is management and storage access preserved);
disaster recovery (who replaces components and how fast, what spares are needed).

After tests, lock into the proposal the items that must be mandatory (RAID, power redundancy, minimum ports and interface speeds, disk class) and what can be accepted as equivalent with matching specs.

Support and responsibilities

Define roles in advance to avoid disputes during incidents: who is responsible for hardware and replacements, who approves firmware versions and the update schedule, where alerts are sent and who is on duty, and where boundaries lie between hypervisor, network and storage.

If you need an alternative, compare vendors using the same SLA metrics: recovery time, redundancy schemes, parts availability and clarity of support — not feature counts.

As a system integrator, GSE.kz typically engages at this point: helping align configuration with public sector rules, organizing test plans and providing 24/7 support through its service network. If localization or supply-chain transparency matters, consider locally produced GSE S200 Series server solutions.