Open source virtualization for a government organization: specification and acceptance
Open source virtualization for a government organization: which items to include in the specification and test protocol to validate functionality, resilience and manageability.

What is commonly forgotten in the spec and why acceptance fails
The main reason acceptance fails in open source virtualization projects for government organizations is vague wording. When the spec says “ensure high availability” or “easy management”, everyone interprets it differently. As a result, the contractor thinks the work is done, while the customer cannot prove they expected something else.
Often acceptance is reduced to checking “the VM boots and pings”. That is useful, but it says little about real operation. Actual problems appear later: migrations fail, access roles are unclear, logging is unsuitable for audits, updates break the cluster, and recovery after a failure takes hours.
To make acceptance fair and reproducible, agree in advance what will be documented. Usually this falls into three groups:
- Functionality: full VM lifecycle (create, clone, migrate, snapshots, delete).
- High availability: which failures are tolerated and within what time the service is restored.
- Manageability: who and how administers, what reports, logs and alerts exist.
The chain is simple: the spec sets measurable requirements, the PМI is written from them (what and how to check), and then a test report and acceptance certificate are produced. If the spec has no numbers or scenarios, the PМI becomes “checked visually,” and the certificate is disputable.
A typical mistake: at acceptance they started 5 VMs, and a week later discovered VMs do not automatically start after a host failure. “HA” was never described as a verifiable scenario with recovery-time criteria.
Baseline requirements from the customer that must be fixed in writing
Acceptance of virtualization usually fails not on settings, but on expectations that were not written down. The same words (“resilience”, “security”) are understood differently by different participants, so formulations must be down-to-earth and testable.
First, record service criticality and measurable downtime and data-loss metrics. Specify RTO and RPO by groups of systems, not “for the cluster overall”. For example: for departmental mail RTO 1 hour, for test stands RTO 24 hours; for databases RPO 15 minutes, for file archive RPO 24 hours. Then the PМI can check exactly that.
Next, record constraints that directly affect architecture: how many sites are available, which links connect them, whether there is power redundancy, how many administrators and in which hours they are actually on call.
It is also useful to state in one line expectations around import substitution and local support: who provides 24/7 support, where the service is located, response times, and which documents confirm vendor or integrator status (these are customer requirements, not promises from the contractor).
To avoid disputes at acceptance, the spec usually lacks the following points:
- Information classes and access modes (admin roles, segmentation, logs).
- Project boundaries: what is included in virtualization and what belongs to storage, network and applications.
- Requirements for deliverables: diagrams, VM inventory, manuals, protocols.
- Assumptions and dependencies (for example, that links and power are provided by the customer).
- Success criteria: what counts as “delivered” for each block.
This saves weeks of approvals and makes tests honest.
Solution composition and delivery boundaries: what to write in the spec
To prevent acceptance turning into the argument “was this included or separate?”, fix the solution composition and what is considered delivery in advance.
List components as roles rather than product names. Typically enough to include:
- Virtualization layer (hypervisor) and supported installation options.
- Management system (console, API, access roles).
- Node services and auxiliary infrastructure (monitoring, logs, update repositories).
- Network and storage components the contractor configures (virtual networks, drivers, multipath, etc.).
- Responsibility boundaries: what the contractor does and what remains with the customer’s IT team.
Then specify minimum node characteristics and compatibility: CPU, RAM, disks for system and VMs, number of ports, 10G/25G requirements, supported controllers and NICs. This is critical when the cluster is built on a specific server platform (for example, rack servers of the GSE S200 Series): tie acceptance to measurable parameters rather than “should be fast”.
Describe the cluster as a separate subpoint: number of nodes, failure domains (rack, power, switch), redundancy scheme, and what counts as a quorum.
Don’t forget usage rights: even in open source solutions there are often mixed components (firmware, drivers, management OS, DBMS, backup). In the spec state who receives licenses, keys, repository access, and what remains with the customer after handover.
For acceptance the contractor must deliver a documentation package: architecture and connection diagram, parts list, installation and update instructions, operation procedures (backup, restore, access management), and a “how it was configured” description so the result can be reproduced.
Functional requirements: basic operations and the VM lifecycle
Describe functionality as testable actions with clear results and constraints. Different distributions and management panels may call features differently; you must verify the actual outcome.
Record the VM lifecycle from template to decommission. Specify which roles can create VMs, from which templates, default parameters (CPU, RAM, disk, network), and what counts as a successful operation.
Minimal set of operations to run in tests:
- Create a VM from a template with specified parameters and verify the guest OS boots.
- Clone a VM (full clone and from template) checking uniqueness of name/ID and network settings.
- Migrate a VM between nodes with no downtime or with an allowed downtime (in seconds).
- Delete a VM: delete VM only or VM with disks, confirming resources are freed.
- Snapshots: create, revert, delete, plus limits (max snapshots and retention) so snapshots don’t grow indefinitely.
Also define resource pools and quotas (e.g., by department). At acceptance this is checked by attempting to exceed CPU/RAM/disk quotas and recording the correct failure.
Tie access control to a directory: which roles exist (platform admin, operator, observer), group rights, and how actions are logged.
If operation relies on automation, include API requirements: which operations are available, how authentication works, whether calls are logged, and how to verify a typical scenario (for example, automated deployment of a test VM on the cluster).
Network and storage: requirements without which tests are useless
If network and storage are described in general terms, acceptance becomes a dispute: “it works” will mean different things to each party. Fix both topology and measurable criteria in the spec.
First, separate traffic flows. For a cluster you usually need distinct contours: management, VM traffic, storage and replication/backup. State where separation must be physical (separate ports or adapters) and where VLANs are sufficient. Specify site-level communication rules: which VLANs may talk to each other, by what controls (ACLs or firewall), and which ports must be open for hypervisor, storage, backup and monitoring.
To make tests reproducible, add minimal artifacts in the spec:
- VLAN table and segment purposes (management/storage/VM/replica).
- Connection map: server port — switch port — speed — mode (LACP, trunk/access).
- MTU and QoS requirements (if needed) and multi-pathing (bonding, multipath).
- Type of storage (local, SAN, SDS) and the delivery boundary of responsibility.
- Target metrics: IOPS, latency, throughput under a typical load.
In the PМI describe the performance test method: load profile (e.g., 70/30 read/write, 4K and 64K blocks), warm-up duration, tool, number of VMs and a “real-life” scenario (several app servers plus one VM with a database). Record results in the protocol: measured latency and degradation when one path fails (link, controller, node).
High availability: how to describe HA and how to test it in practice
Start by agreeing terms: what is considered a failure and what the system must do automatically. Acceptance usually focuses not on whether “HA exists”, but on measurable behavior under specific failures.
Specify which failures will be tested: compute node crash, disk or pool failure, network link loss, management component failure (API, scheduler, web console), and database or quorum unavailability if applicable. For each scenario the spec must state expected system actions and allowable time limits.
How to describe expected behavior
Formulate acceptance criteria as metrics: time to detect failure, time to restore service (RTO at VM or application level), automatic VM restart, data preservation, and notifications (where and in what form). Also describe planned maintenance: live migration without downtime, acceptable maintenance windows and what counts as service interruption (for example, a short network blip up to N seconds is acceptable).
How to actually verify during tests
In the PМI list tests and what is recorded in the protocol. For a departmental cluster on rack servers (like GSE S200) a few hard tests are usually sufficient:
- Cut power to one node and measure time to restart VMs on remaining nodes.
- Pull a network cable and verify storage access and correctness of alerts.
- Simulate disk or volume failure and confirm degraded status and recovery after replacement.
- Stop a management service and check that VMs continue running and management recovers per procedures.
Attach metrics (times, statuses), event/log exports and screenshots or command outputs to the protocol. Then the dispute cannot be “it seemed to work”.
Backup and disaster recovery: acceptance criteria
Backups are often described in one line, and at acceptance you find backups exist but there is nothing to restore or nowhere to restore to. In the spec lock down: what is an object of backup (hypervisor and cluster configs, VMs, disks, snapshots, templates, metadata), where copies are physically stored and who has access.
Also specify protection requirements: encryption of backups, key management, separation of roles (the virtualization admin should not automatically have access to the backup store), and retention periods by data class.
In the PМI always include a restore check in a separate zone. The test contour must be isolated so recovery does not affect production or network. The success criterion is not “VM booted” but “the service works”: OS boots, network access, data integrity, correct time, and account accessibility.
For DR between sites specify the minimal service set that must come up first (for example: DNS/AD or other directory, NTP, monitoring, proxy/gateway, 1–2 key application services). If there is no second site, define a simplified scenario: restore to an alternate cluster in the same DC.
Acceptance criteria are easy to formalize:
- N VMs successfully restored from backup within time T with service functionality verified.
- Achieved RPO and RTO from the spec for each criticality group.
- Copies are stored separately from production and encrypted.
- Restoration followed the procedure and steps are reproducible.
- A schedule of regular recovery tests is assigned (e.g., quarterly) with a result protocol.
Security and audit: what to include in the spec and tests
Security in virtualization often “breaks” not because of an intrusion, but due to loose privileges and lack of clear logs. In the spec record: who can do which actions, how user identity is confirmed and how it is later proven that an operation happened.
Start with an RBAC model and ban shared accounts. Typical roles: platform administrator, operator (on-call), auditor (read-only), requester (creates tickets but not changes prod), and limited roles by project or department. Specify separation of duties: the person who creates users and rights must not be able to change logs unnoticed.
Define action logs as a separate deliverable: which events are recorded (logins, VM changes, network, storage, rights), format, storage and retention (for example, at least 1 year), time synchronization (NTP), export to an external event system and protection from deletion by ordinary roles.
For access require authentication integration with a directory, password policy, lockout on brute force, and MFA for privileged actions. For management and API require encryption (TLS), certificate requirements and renewal procedure.
Useful acceptance checks:
- The operator attempts to delete a VM or change rights and is denied.
- The “audit” role can view events but cannot modify anything.
- Login without required MFA is rejected and logged.
- Connection to the management panel without a trusted certificate is blocked.
- After changing VM parameters the log shows who did what and when, and the entry cannot be deleted by a normal role.
Manageability and operations: requirements that save time
Even if functional tests are passed, acceptance often fails in operations: missing clear metrics, alerts not reaching on-call staff, updates performed “manually at night”. In the spec require more than “a management console” — require operational rules that can be verified.
For monitoring describe what is collected and how long it is retained. Minimum set:
- CPU and RAM usage for hosts and VMs, including oversubscription.
- Storage state: latency, IOPS, utilization, capacity forecasting.
- Network: packet loss, latency, port errors, channel utilization.
- Cluster events: migrations, service failures, degradations.
- Trends and reports over a period (e.g., 30–90 days).
Tie alerts to roles: who receives an incident (on-call, admin, security), by what channel, and what counts as acknowledgement (for example, a receipt or a journal entry). Specify response times and escalation if there is no acknowledgement.
Updates and patches: prescribe the update order, compatibility checks, maintenance window and mandatory rollback. For government customers it is important that updates do not require a full cluster shutdown and that they leave a record in change logs.
Acceptance criteria are conveniently specified as operational tasks and completion times:
- Create a VM from a template and start it — within N minutes.
- Create a snapshot and revert — within N minutes.
- Add a disk and extend filesystem — within N minutes.
- Migrate a VM without downtime — within N minutes.
- Produce a capacity and incidents report for a month — within N minutes.
Also require administrator training and handover time. Without that even a good solution becomes dependent on a single person.
Step-by-step: how to compose the test program and methodology (PМI)
The PМI is needed so each spec item becomes a measurable test: what to do, what evidence to collect, and what counts as success. Without clear proof acceptance easily becomes a dispute “works or not”.
Start with a traceability table: spec requirement — test scenario — pass criterion — artifacts. Then prepare the PМI so it can be executed step-by-step and repeated on the same bench.
Basic PМI structure
Five blocks are usually enough:
- Composition and boundaries of the bench: nodes, network, storage, software versions, access roles, what is included and excluded from tests.
- Initial data: VM templates, test accounts, load profiles, “reference” settings.
- Test steps: operator actions in sequence, with no ambiguity.
- Expected result: concrete values (switch time, cluster status, migration success, audit entry presence).
- Result recording: where and how evidence is stored, who signs.
Avoid “check operability”. Prefer: “create a VM from template X, start it, obtain IP, log in, record time, save event Y in the log”.
What to use as evidence
Agree on artifacts acceptable to the committee:
- Screenshots of key screens (statuses, errors, versions).
- Cluster and network configuration export.
- Service logs and audit trail.
- Metrics (CPU/RAM/IOPS/latency) for the test period.
- Command and operator action log with timestamps.
Pass criteria should be rigid: “pass/fail” and what counts as a defect (loss of management, config rollback, version mismatch, missing logs). Acceptable deviations belong to a separate list of remarks: description, risk, fix deadline, temporary measure, committee decision.
Common mistakes in the spec and acceptance protocol
The most common reason for failing acceptance is not the technology but the wording. When the spec says “should work stably”, each side imagines different results and there is nothing measurable in the protocol.
Costly mistakes made later:
- Requirements without metrics: no numbers for availability, recovery time, maximum downtime, or acceptable data loss.
- Mixing levels: platform requirements combined with specific departmental application requirements. The committee then cannot clearly decide what to test.
- “Functionality checked, so everything is fine”: they boot and migrate VMs but do not test node, network and storage failures or recovery criteria.
- Roles and rights not described: who creates VMs, who changes network, who views logs, who approves changes.
- Updates left out: no patching order, maintenance windows, rollback plan or compatibility checks.
To avoid this, include in the PМI mandatory scenarios: host failure and auto-start of VMs, restore from backup, audit logging checks and an update test with rollback. In SI projects these scenarios are usually agreed before work starts so the committee accepts a result, not a demo. For example, GSE.kz records such checks before work starts so acceptance is not a “demonstration” but a reproducible outcome.
Short checklist before agreeing the spec and before acceptance
Agree in advance what counts as “working” and how it will be proven. The checklist below helps quickly find gaps in documents and tests.
Before agreeing the spec
Check the spec contains measurable requirements and clear delivery boundaries:
- Solution composition and boundaries: what is included (hosts, network, storage, software, support) and what is not.
- VM lifecycle functions: create, start, stop, migrate, clone/templates, snapshots, delete.
- Metrics and thresholds: capacity (VMs, vCPU, RAM, storage), target performance, RTO/RPO and allowed downtime.
- Security and audit: roles, minimal privileges, admin action logging, log storage and export requirements.
- Acceptance documents: PМI, protocol format, list of artifacts for the acceptance certificate (configs, logs, reports).
Each item must be testable: test, expected result and pass/fail criterion.
Before signing the acceptance certificate
At acceptance check not only “boots” but also “is manageable”:
- High availability: simulate host/node failure and confirm VM recovery within agreed time.
- Backup and restore: test restore of one VM and one file, record actual time.
- Roles and access: login with different roles, forbidden actions blocked, audit events recorded.
- Monitoring and alerts: visibility of hosts/VMs/storage, alerts for capacity and unavailability.
- Evidence package: config exports, key logs, test reports, signatures of responsible parties.
If any item cannot be confirmed by a document or reproducible test, fix it before signing.
Example realistic acceptance for a departmental cluster
A sample baseline: cluster of 4 nodes (3 compute + 1 for quorum or spare), 30 VMs. Of these 10 critical (accounting systems, directory, mail), 15 medium (web services, internal reporting), 5 test. In the spec fix which VMs must survive failures without downtime and which may have a short interruption.
Acceptance scenarios are best written as short practical tests with measurable success criteria:
- Live migration of a VM between nodes under load: sessions and network do not break, average migration time stays within the agreed limit.
- Node power-off: critical VMs automatically start on other nodes, each VM’s downtime does not exceed X minutes, events are logged and notifications sent.
- Restore a VM from backup: recovery follows procedure, service passes checks, actual RTO and RPO meet spec.
- Role-based access and audit: admin, operator and auditor have different rights, forbidden actions are blocked, configuration changes are logged.
Agree in advance what is written in the protocol. Minimum set:
| What we record | How we record it | Why |
|---|---|---|
| Start and end time | timer, logs | compare to normative |
| Load (CPU, RAM, I/O) | platform metrics | prove the test was realistic |
| Errors and event codes | event log | find causes of deviations |
| Notifications | screenshot or export | prove manageability |
Summarize results in a table per scenario (actual vs criterion) and a separate list of remarks: what to fix, who is responsible, deadline, and whether a retest is needed.
Next steps: pilot, documents and handover to operations
The main risk is when pilot and acceptance live separately from real operations. After a draft spec and PМI, run a short but honest verification cycle on a pilot bench.
Assemble a working group immediately: IT (architecture), security (requirements and audit), operations (on-call and procedures), procurement (delivery boundaries and documents). Assign one owner of requirements who records decisions and disputed items so they do not surface at acceptance.
A simple plan helps:
- Deploy a pilot (even 2–3 nodes) and run the PМI before purchasing the main batch.
- Agree minimal metrics: VM deployment time, RTO/RPO for critical services, node-failover time.
- Approve a set of failure tests: host power-off, network cut, storage degradation, loss of management.
- Prepare the documentation set: operations procedures, role matrix, update plan, report templates for audits.
- Schedule training: administrators, on-call team, security, and a short “what to do at 02:00” scenario.
If the pilot needs help selecting servers and implementing, it is convenient to have one team handle it: vendor and system integrator responsible for hardware and support. In Kazakhstan this format is often provided by GSE.kz (gse.kz): the company offers its own S200 Series servers and system integration services, including round-the-clock technical support.
FAQ
How to state “high availability” in the specification so it can be accepted during tests?
Record measurable criteria: which failures are tolerated, what the system must do automatically, and how quickly services must be restored. At minimum — a scenario for host failure with automatic VM restart, allowed downtime in minutes, and how this is documented (events, logs, timestamps).
Is it mandatory to include RTO and RPO in the spec, and how to do it without excessive theory?
Specify RTO and RPO by groups of systems, not “on average across the cluster”. That way, during acceptance you can restore specific VMs or services and compare actual recovery time and data loss against the spec.
Which VM operations should be included in acceptance tests first?
Describe the VM lifecycle as testable actions with expected results: create from a template, start, clone, snapshot, revert, migrate, delete with or without disks. Specify limits and success criteria in advance, for example allowed downtime during migration in seconds and how resource release on deletion is verified.
What is the difference between the spec and the test plan (PМI), and why does acceptance turn into a dispute without a PМI?
The spec sets requirements; the test plan (PМI) turns them into step-by-step checks, and the test protocol records the outcome with evidence. If the spec lacks numbers and scenarios, the test plan often becomes a visual check and the acceptance record turns into a dispute.
What must be described about the network so virtualization acceptance does not fail?
Separate traffic into contours (management, VM traffic, storage, replication/backup) and state where physical separation is required and where VLANs suffice. Add a VLAN table, a connection map “server port — switch port — speed — mode”, MTU and bonding/multipath requirements so tests are reproducible.
How to set storage and performance requirements so they can be realistically tested?
Specify the storage type and the boundary of responsibility, plus target metrics under a representative load: latency, IOPS, throughput. Lock the test methodology in the PМI: read/write profile, block size, warm-up duration and a single-path failure scenario so results can be compared and repeated.
What RBAC and access requirements should be included in the spec for a government organization?
Start with a role model and ban shared accounts, then tie access to the user directory and list allowed actions per role. During acceptance this is checked by attempting forbidden operations as an operator or auditor and recording the denial and log entry.
How to formulate logging and audit requirements so checks can be passed later?
Describe which events must be audited: logins, VM changes, network, storage and rights changes, retention periods and protection against deletion by normal roles. Record time synchronization (NTP) and how logs are exported for checks — otherwise proving actions at acceptance is difficult.
What must be specified about backup and recovery to avoid failing acceptance?
Define exactly what is backed up (hypervisor and cluster configs, VMs, disks, snapshots, templates, metadata), where copies are stored and who can access them, including encryption and role separation. In the PМI include a real recovery test in an isolated zone and require “service works” as the criterion, not just “VM booted”.
How to include updates and patches in the spec and acceptance so operations do not remain manual and night-time?
Document the update procedure: maintenance window, compatibility checks, mandatory rollback plan and recording of changes. During acceptance it is useful to run at least one controlled update scenario on an agreed version to confirm the cluster does not require a full stop and that there is a clear return plan.