Where a high SLA for critical systems begins

A high SLA doesn't start with picking the "most reliable" model, but with honestly answering: what do you consider an outage and what losses does it cause? For critical applications it's important to define in advance which failures are acceptable and which are not. That then turns into architecture, procedures and a support agreement.

Most often SLAs are "broken" not by rare catastrophes but by mundane events: power loss in a rack, a controller reboot, a SAN network drop, incorrect multipath configuration, or an administrator performing a risky action during working hours. Therefore choosing a storage array like Hitachi VSP E-Series for critical applications only makes sense together with a well-thought-out deployment plan: redundant power and ports, correct zoning, a clear change plan and distributed responsibility.

Identical arrays in two companies can produce very different results. In one, a failure is nearly invisible because there are two independent paths to data and updates follow procedure. In another, there's one switch, one path, "temporary" settings and updates during peak hours. The problem then looks like "the storage failed", although the root cause is the surrounding architecture and operational discipline.

Before discussing IOPS and capacity, lock down basic availability and recovery parameters. At minimum you need:

RPO (how much data loss is acceptable),
RTO (how fast recovery must be),
maintenance windows (when changes are allowed),
acceptable risks (which rare events the business can accept).

Useful questions for the business so the discussion is about SLA, not just "speed":

What counts as downtime: application unavailability, degraded performance, loss of some functions?
How many minutes of downtime per month are truly acceptable for each system?
What data loss is acceptable: 0 seconds, 5 minutes, 1 hour?
Are there "peak" periods when any work is forbidden (reporting, payment windows)?
Who decides to stop, roll back or trigger the disaster plan?

If, for example, a financial system requires 99.99%, then "almost always works" isn't enough. You need an agreed failover plan, clear roles and a change practice where every action can be repeated and verified.

How to turn SLA into concrete technical requirements

An SLA is only useful when you can translate it into numbers and tests. Start simply: what downtime is acceptable at all. For example, 99.99% availability is about 52 minutes of downtime per year, while 99.9% is roughly 8 hours 46 minutes. It's important to break these minutes and hours down by cause: failures, degradation, recovery and planned maintenance.

Separate service availability from storage availability. A storage array can be up while the business service is down due to application, SAN network or cluster issues. Therefore requirements should be recorded at two levels: what the storage platform provides (fault tolerance, component replacement times, maintenance modes), and what the whole system needs (including servers, network and processes).

Next, define RPO and RTO. RPO covers data loss (how many minutes of data can be lost), RTO — the time to return to operation. If RPO approaches zero, you usually need synchronous replication and an architecture without a single point of failure. If RTO is strict, not only copies matter but also the playbook: how services are brought up, who performs the switch, and where manual steps exist.

To make the SLA testable in operations, agree in advance on metrics and thresholds. It's usually sensible to monitor:

95th/99th percentile read/write latency;
time spent in degraded mode (after a component failure);
actual time to return to normal performance;
speed and success rate of recovery from backups;
response and on-site arrival times for incidents.

Also define what counts as planned work. Does firmware updates, battery replacement, shelf expansion or DR testing count? For critical systems on Hitachi VSP E-Series, teams commonly want non-disruptive servicing and clear maintenance windows so planned actions don't turn into unplanned downtime.

Redundancy: what should survive a failure without downtime

A high SLA starts from a simple rule: any single failure should not stop the service. For storage arrays like Hitachi VSP E-Series this means more than "two controllers" — it requires duplication of the entire path from the application to the disks and back.

What exactly should be duplicated

Look at the system as a chain. Any element without a pair or an alternate path is a future outage.

Controllers: two controllers (active-active or with automatic takeover), cache mirroring and cache protection (batteries/flash) as required by the platform.
Ports and internal paths: front-end and back-end, buses and shelf connections so that failure of one line doesn't make part of the disks inaccessible.
Power: at least two independent power supplies in the array, preferably fed from different PDUs/lines in the rack.
Cooling: redundant fans and predictable behavior when a module fails (no overheating or performance collapse).
Storage network access: two independent SAN fabrics (Fabric A and Fabric B), separate switches, ports and cables.

Even a perfect array won't help if hosts have a single path. Multipath must be configured and tested: two HBAs (or two network ports for iSCSI/NVMe-oF), different slots/buses, different switches. It's important not just to enable multipath but to verify that I/O continues without hangs and without manual intervention when one path fails.

How to check for SPOFs in the design

A useful technique is to perform a "thought experiment" by removing each element and seeing what happens to access to LUNs/volumes.

Disable one controller on the diagram: does access remain, is cache preserved, does performance drop below acceptable levels?
Remove one SAN switch: is there a path through the second fabric?
Disconnect one HBA/port on a server: does the application keep working?
Power off one PDU: will both array PSUs or both server PSUs remain powered?
Cut one cable (FC/Ethernet): is there a physically separate second cable that doesn't follow the same route?

Practical example: a database server connected to both controllers but with both cables going to the same SAN switch — failure of that switch will look like "disks suddenly disappeared". On paper everything is redundant, but in reality a SPOF remains.

Protecting data inside the array: more than RAID

RAID is often seen as the main answer for reliability. For high SLAs that's not enough: you must know what happens to data during a disk failure, during a rebuild and during maintenance. This is especially important when Hitachi VSP E-Series are used for databases, virtualization and financial transactions.

Choose RAID level by workload profile. For heavy random writes (e.g., OLTP) predictable latency matters more. For very large drives and capacities the main risk is long rebuild times, increasing the chance of a second failure. In practice the winner is not the "fastest RAID" but the one that survives rebuild without data loss or severe performance degradation.

Hot spares are not just a checkbox. Important factors are how many spare drives are available, whether they match class and capacity, and how reconstruction is configured: rebuild priority, load limits on shelves, and policies for multiple failures. Poor rebuild policies can create hidden outages: the system may be nominally up, but applications start failing due to timeouts.

What to check before purchase and in operation

Practical items often forgotten:

Snapshots: frequency and retention depth. Snapshots should protect against logical errors but not overload the array with metadata and extra writes.
Encryption: where keys are stored, who has access, and how recovery works after controller replacement or failure.
Drive lifecycle: scheduled replacement by age to avoid a series of failures from a single batch.
Degradation monitoring: thresholds, alerts and engineer actions before rebuild starts and before users complain.

Simple example: 5-minute snapshots help recover from accidental deletes but add load. Measure the impact on peak windows ahead of time and agree a compromise with the application owner.

Replication and DR schemes: choose by RPO/RTO, not by fashion

Update plan without surprises

We will create an update and rollback policy so firmware updates don't become outages.

Agree plan

When considering Hitachi VSP E-Series for critical apps, start not from technology names but from two numbers: RPO (how much data can be lost) and RTO (how quickly you must be back). Everything else is a way to meet these numbers.

Local resiliency (within a single site) covers controller, shelf, port and SAN path failures, sometimes even partial network outages. DR (a second site) is needed when you must account for fire, power loss at the site, building unavailability or a city-wide outage.

Synchronous replication is appropriate when RPO is near zero. But it demands low latency and stable links: every write must be acknowledged on both sides, otherwise applications will wait. So verify real latency, jitter and channel redundancy, not just contract bandwidth.

Asynchronous replication is easier on networks but introduces measurable RPO. You can estimate RPO as the maximum of: the average change rate, transfer window and time replication is paused for maintenance or failures. Always account for peaks (e.g., end-of-day) and queue growth.

Choosing active-active or active-passive often comes down to operations.

Active-passive is usually simpler for applications and support: one site runs, the other is ready to take over.

Active-active lowers RTO but requires discipline: identical configurations, split-brain control and clear rules for conflict resolution.

Short practical minimum for a DR plan:

Record RPO/RTO per service, not just "for the system as a whole".
Describe failover triggers and who decides, to avoid false switches due to application faults.
Test failover on a schedule and after major changes.
Log results: times, data loss, manual steps and failures.
Separate scenarios: site failure, storage failure, application failure, human error.

Example: a payment system requires RTO 30 minutes and RPO 0–5 minutes. A combined approach often emerges: local resiliency for small failures and asynchronous replication to a second site with regular tests so RPO remains within peak load bounds.

Firmware update policy: avoid accidental outages

You don't update array firmware for the sake of it. For high-SLA systems it's a way to fix bugs and vulnerabilities, but it's also a common source of unplanned downtime if done without rules. For critical Hitachi VSP E-Series setups define when you update, how you check compatibility and what you do if something goes wrong.

The approach is usually mixed. Scheduled updates (e.g., quarterly) provide predictability and normal maintenance windows. Emergency updates are required for confirmed risks: a critical vulnerability, a bug already observed, or a vendor recommendation for your configuration. The policy should define what counts as an emergency update and who decides.

Before any work, check compatibility not only at the array level but across the stack. Problems often arise from the combination: multipath versions on hosts, HBA drivers, SAN switch firmware, zoning parameters and FC adapter firmware.

Also specify which components are updated in sequence: controllers, drives, interface modules. A "safe" controller update can fail because of outdated firmware in a specific batch of drives or a module.

To keep maintenance windows from eating into SLA, define three things in advance:

rollback plan and recovery point (what to do if performance degrades or path errors appear);
stop criteria (which symptoms cause stopping work and rolling back);
responsibility matrix: who prepares the plan, who executes, who approves the result.

In practice it's useful when the integrator produces a scenario and risk list, operations confirms the window and application availability, and the business accepts return-to-normal based on pre-agreed metrics.

Service requirements: choosing the right 24/7 support

24/7 support is not "always required" but should be matched to specific risks. For critical Hitachi VSP E-Series systems decide in advance which events must be handled at night/weekends and which can wait until business hours.

Typically the following require round-the-clock handling: complete loss of volume access, degradation with risk of a second failure (e.g., second controller or path loss), replication failures that threaten RPO, and any alerts that can lead to service outage. Planned work, pool expansion, workload moves and reports are usually okay during business hours.

Response vs. recovery: don't confuse metrics

Contractually separate SLA for response and SLA for recovery.

Response – time until an engineer is on the line and begins diagnostics.

Recovery – time until the service is back in working condition. A temporary workaround counts if it removes downtime.

Make sure the contract specifies recovery, otherwise you may get a fast chat reply but hours of downtime.

Spares, monitoring, escalation and reporting

Before signing, confirm practical items that affect incident outcomes:

where spare parts are stored (city, warehouse, service center) and how delivery time is confirmed;
what remote diagnostics include: log access, telemetry, who monitors alerts and how;
escalation procedure: levels, contacts and how incident start and handover times are recorded;
on-site dispatch rules: engineer arrival windows and access conditions;
post-incident report: cause, actions taken and measures to prevent recurrence.

Practical example: replication shows a red alert at night but users continue working. If monitoring isn't 24/7 the problem is seen in the morning and you lose RPO. If monitoring exists but there are no local spares, recovery is limited by logistics. So demand measurable items and a clear process, not vague promises.

Step-by-step configuration selection for high SLA

Servers for high availability

We will select GSE servers and rack layout for clustering, networking and power redundancy.

Request selection

To actually meet a high SLA, start not from model and disk quantities but from how the application runs: workload profile, peaks, allowed minutes of downtime per year and maintenance windows. Fix growth forecasts for 12–36 months up front, otherwise SLA degrades simply due to lack of capacity.

Then choose the target SLA and convert it into money. When the business can see the cost per minute of downtime, it's easier to agree on redundancy level, DR scheme and support.

Practical sequence:

gather application requirements: IOPS/latency, capacities, maintenance windows, growth, criticality by service;
define SLA and acceptable downtime, agree a "risk budget";
fix RPO/RTO and choose replication/DR accordingly;
design end-to-end fault tolerance: hosts, SAN, array, power, sites;
agree firmware update policy, test schedule for failover and acceptance criteria.

Don't limit yourself to "an array with redundant modules". SLAs usually fail at interfaces: a single SAN switch, cables following the same route, two PSUs on the same PDU, unaccounted port or queue limits on the host.

Example: a payment system needs near-zero RPO and RTO up to 15 minutes. That usually means synchronous replication within achievable latency between sites, plus a rehearsed application and network failover plan. If sites are far apart, synchronous replication may be impossible. Then honestly choose asynchronous replication with an acceptable RPO and offset with processes.

Also agree how you will operate with updates and incidents. Contracts should contain verifiable items, not generalities:

firmware update windows and rollback procedures;
DR test frequency and what counts as a successful failover;
response and recovery times (distinct metrics);
availability of spares and engineers 24/7, escalation routing;
measurable acceptance criteria: latency, throughput and behavior under node failure.

If you engage an integrator, ask for a single document with the matrix: "requirement -> technical solution -> test." It quickly shows whether the SLA is covered in practice, not just on paper.

Common mistakes when choosing storage for high SLA

Even an expensive array won't prevent downtime if requirements and architecture are based on gut feeling. Below are common mistakes when selecting Hitachi VSP E-Series for critical apps and when designing storage for high SLA in general.

Architecture and calculation mistakes

One category is when redundancy exists on paper but not in the real data path. You buy a system with two controllers and then leave a single SAN switch, one HBA on a server or one path in multipath settings. The result: failure of a "small" element above the array stops the whole service.

Another frequent mistake is not fixing RPO and RTO as concrete numbers. The phrase "it must always work" leads to disputes during incidents: some tolerate minutes, others require synchronous replication and instant failover.

A third is estimating performance from a test environment. If tests don't reproduce production IOPS/latency profiles, you will get wrong expectations about response and recovery speed.

Operational and support mistakes

High SLA requires change discipline. Without firmware policies, maintenance windows and rollback criteria, an update becomes an experiment and a risk of unplanned downtime.

DR procedures are often underestimated. Replication may be configured, but without regular recovery tests (and checks of access rights, networks, DNS and startup order) a real disaster exposes issues too late.

Saving on support almost always costs more later. In a 24/7 environment you need agreed support levels and responsibility for on-site response and spares, otherwise hours are lost to waiting and approvals.

Short example: a financial system has a 2-hour nightly batch. Without a fixed RTO, after a failure you might "recover by morning", but the business misses its processing window. If RTO is fixed at 15 minutes, replication, spare capacity and support requirements become concrete and testable.

Example scenario: a payment system with 99.99% requirement

Platform software and licensing

We will help choose and integrate software and licenses for the platform and services.

Select software

Imagine a bank payment system running 24/7 with an SLA of 99.99% — about 52 minutes of downtime per year, including planned work. There are two data centers: primary and secondary. Goal: failure of a node, SAN switch or an entire site must not cause a long incident.

For Hitachi VSP E-Series storage teams usually define RPO and RTO numerically before choosing architecture. For example: RPO 0 for most transactions and RTO 15 minutes on primary site loss.

Replication: synchronous or asynchronous?

If inter-site latency is low and the channel is stable, synchronous replication is chosen to achieve RPO 0. If distance is large or latency can't be guaranteed, asynchronous replication is used and an honest RPO (e.g., 30–120 seconds) is accepted in exchange for greater link resilience.

Before the final decision verify:

real latency and jitter on the channel during peak hours;
available bandwidth and its redundancy;
split-brain scenarios and rules for which site becomes active.

Minimum redundancy and tests

To avoid a weak link, include at least:

two independent SAN fabrics and multipath on hosts;
application and DB clusters with clear quorum rules;
redundant power and management networks;
specific plans for zone/switch/controller failures.

Practice switchovers without stopping business: perform scheduled switchovers in low-load windows, run test transactions, then switch back and record the actual RTO.

For acceptance and audit prepare: architecture diagrams, an RPO/RTO matrix per service, a failover runbook, test logs, change journal (including firmware), and 24/7 support contacts with target response and escalation times.

Short pre-purchase checklist

Before buying storage for a high SLA, run a final check not only on the array but on the entire environment. Even a strong storage platform won't help if downtime occurs due to an unnoticed SPOF or an untested DR.

Quick items to verify before signing the spec

Go through these points and record answers in writing: what's done, who's responsible and how it's tested.

SPOFs: two independent power sources and supply chains, two SAN switches, two HBAs per host, separate ports on the array, multipath configured and tested. Also check DNS, accounts, privileges and console access.
DR and replication: scheduled tests (no less often than your policy), clear scenarios (site loss, array loss, logical error), assigned owners and measured failover times.
Firmware: agreed update process, windows, compatibility checks with OS, HBA and SAN firmware. Have a rollback plan and stop criteria.
24/7 service: defined response and recovery times, location of spares, escalation paths and the post-incident reporting you will receive.
Acceptance: predefined failure tests (pull-the-plug), success criteria and sign-off responsibility.

After this review it usually becomes clear what's missing in architecture or processes.

Next step — collect inputs and perform a pre-project survey with an integrator such as GSE.kz to refine the architecture, equipment list and support model. GSE.kz works as a system integrator and manufacturer of computing equipment in Kazakhstan, which makes it convenient to unify infrastructure and support in one plan.

Minimum data set for the survey: list of applications with RPO/RTO, current IOPS/latency, SAN layout, OS and driver versions, and change windows.