Dec 25, 2024·7 min

Choosing a Virtualization Platform: A Checklist for Your Organization

Choosing a virtualization platform: a practical checklist on resources, licensing, high availability and backing up virtual machines.

Choosing a Virtualization Platform: A Checklist for Your Organization

Where to start: what do you need virtualization to achieve

Virtualization is almost never about "technology for technology's sake." It's about a clear result: less downtime, faster rollout of new services, easier maintenance, and sometimes hardware cost savings. Start by defining which problems you want to solve over the next year.

Common scenarios: server consolidation (fewer physical machines), VDI for employees, separate test environments for developers, or a recovery site for disaster recovery. Each scenario has different bottlenecks. VDI hits network and graphics sooner, test environments need flexibility and fast snapshots, and critical services require high availability.

Begin with a short list of "pain points." These are usually growing loads and lack of resources, downtime from single failures, complex updates, an overloaded team, and budget or procurement limits. These pains directly affect the platform choice: you'll quickly run into resource calculations, licensing models, and support levels.

To avoid redoing the setup in a year, decide up front which services are critical and what downtime is acceptable. Decide where virtualization will run: a single server, a cluster, one site or two. Define who will operate the platform and the required support response time. Plan a growth horizon (typically 12–36 months). If equipment origin, certifications and procurement specifics matter for you, record them early.

Example: if the organization runs accounting, mail and a document system, 4 hours of downtime may be unacceptable, while test environments can tolerate overnight outages. In that case it's wiser to plan a cluster and a clear backup strategy from the start rather than "one big server for everything."

Inventory and requirements: the numbers you need to avoid guessing

Choosing a virtualization platform doesn't start with brand comparison but with a simple question: what exactly will you virtualize and what loads must it handle? Without inventory it's easy to buy too much or hit a ceiling in six months.

Record the basic picture: how many VMs you already have (if virtualization exists), how many physical servers you plan to migrate, and expected growth over 12–36 months. Growth matters not only in VM count but also in "weight": new databases, reports, analytics and integrations often consume resources faster than expected.

Next you need metrics. Ideally, pull them from monitoring, the hypervisor (if present), OS counters and storage. If there's no monitoring, spend 1–2 weeks measuring—it's still better than guessing.

Usually 80% of clarity comes from five groups of numbers:

  • CPU: average and peak utilization, and headroom for peaks.
  • RAM: used and peak, how often swapping occurs.
  • Disks: IOPS and latency for read/write, hottest volumes, data growth.
  • Network: peak traffic and its directions (clients, branches, DR site).
  • Criticality: RTO/RPO per service (how much downtime and data loss is acceptable).

Analyze peak loads and the most sensitive systems separately. For example, 1C and databases often choke on disk latency rather than CPU. Mail and file services may demand capacity and throughput. In healthcare and finance, predictable latencies and strict access requirements are crucial.

It helps to make a table of "critical services" and note what's unacceptable for each: downtime over N minutes, data loss over N minutes (RPO), disk latency above X ms at peak, storing data outside approved sites, lack of access logs or role separation.

Don't forget regulatory and internal policy requirements: where data must be stored, which logs are required, who can administer, and whether public cloud is allowed. In Kazakhstan this often comes down to rules about data location and provenance, procurement procedures, and confirmation of equipment origin.

Platform architecture: hypervisor, management, security

The platform architecture determines how easy daily operations will be. Mistakes here are rarely obvious in a pilot but become clear as you scale: more VMs, more admins, more security and reporting needs.

Start with the hypervisor. It should support your current hardware and what you plan to buy in the next 3–5 years. Check compatibility with guest OSes and drivers. If you have specialized systems (e.g., medical or financial), confirm their OS is supported and whether special settings are required.

Run through a few checks: is there official compatibility with your servers, network cards and storage arrays? Are needed guest OS versions supported (including legacy ones if critical)? How do VM migrations and updates work with minimal downtime? Are resource limits clear and how is oversubscription calculated? Is there a clear path to clustering if you currently start with 1–2 nodes?

Next—management. You need simple daily operations: create a VM from a template, update it, allocate resources, view utilization, and quickly find performance bottlenecks. Roles and permissions should be flexible: one admin manages the network, another only VMs, a third views reports. Check audit logs: who did what and when.

For security look at network segmentation, access control and auditability. The minimum to require: clear roles and separation of duties, MFA for admin accounts, event logs exportable to monitoring or SIEM, separate networks for management and storage, and encryption support where policy requires it.

Don’t forget integrations: user directory, monitoring, VM backups. Sometimes a platform "technically supports everything" but the configuration you need is costly due to licensing or limitations. A common example is backup without agents and frequent restore points being available only in higher editions.

Resources and hardware: how to avoid miscalculations

Resource calculation errors usually show up after 3–6 months: VMs slow down and adding hardware becomes more costly and complex than planning ahead. First assess the load, then choose specific servers, disks and network.

Minimum numbers for planning

Rely on measurements and add headroom for growth. If you have no metrics yet, start with a clear classification: how many VMs are planned, which are "heavy" (databases, 1C, VDI), which are uptime-critical.

Then check five areas: CPUs (how many cores and the headroom), memory (real VM needs plus management services and expansion plan), storage (IOPS profile vs capacity, space for snapshots and metadata), network (speeds for actual load and separate management/storage networks), and power/cooling (UPS, supplies, rack capacity and ventilation).

Snapshots are not backups, but they consume space. If you take frequent snapshots before updates, budget space for them or storage can unexpectedly fill up.

Short example

Suppose an organization has 40 VMs, 6 of them critical (databases and file services). If you build a 3-node cluster, design it so that when one server fails the remaining two can handle the load without CPU overheating or RAM shortage. Practically this means avoid "just enough" configurations and decide how you'll scale: by adding resources to existing nodes or installing another server. Also verify that network and storage can handle that scenario.

Licensing and TCO: honestly calculating ownership cost

Backup and recovery testing
We will design a 3-2-1 backup scheme and run a test restore on schedule.
Set up backup

Licenses are easy to compare by upfront price, but a platform will live 3–5 years and most costs hide there. Calculate TCO (total cost of ownership) over the period: licenses, support, upgrades, training, plus downtime and admin labor.

The main trap is the licensing model. Some vendors license by sockets, others by cores, hosts, or feature sets. Check how physical cores are counted: is there a minimum per CPU, is Hyper-Threading considered, what happens on CPU upgrade, and do you need to buy additional licenses when adding a host to a cluster?

Clarify what the base license includes. Often what organizations expect as default costs extra: centralized console, roles and audit, encryption, replication, live migration, automatic load balancing, and backup integration. Request a list of options by edition and mark what you cannot operate without.

To make TCO fair, combine support and subscription for 3–5 years, guest OS licenses (e.g., Windows Server rights), database and other server product licenses (often costlier than the hypervisor), migration costs between editions and vendor lock-in risks (disk formats, portability, support conditions).

Practical example: a plan for a 3-server cluster with a fourth added in a year. If licensing is per host and key HA features are only in a higher edition, budget for expansion up front rather than assuming "we'll sort it out later."

High availability: surviving failures without long outages

High availability exists to guarantee predictable downtime. First decide what failures you need to tolerate: one host failure, a switch outage, a storage controller fault, or an entire site failure.

Start with HA levels. The simplest is automatic VM restart on another cluster node. Next level is live migration between hosts for maintenance (ideally with minimal pause). Beyond that are load-distribution scenarios and, for high requirements, geo-redundancy: a second site to fail over to.

Then check whether the cluster tolerates one-node failure. A common mistake is building three nodes and filling them to 80–90% and calling it an HA cluster. If one node fails the others may not physically handle the load. Design for N+1 so critical VMs keep running with headroom in CPU, RAM and IOPS.

Single points of failure often hide around the hypervisor: storage (one array or one path), network (one switch or uplink), power (one UPS or feed), management (one management server without access plan), and also DNS/AD and time sources (no synchronization or failover plan).

Maintenance scenarios should be as clear as failure scenarios. Plan how to update the hypervisor and management components: can you do it node by node, moving VMs to neighbors, and who decides when to start work?

Record targets in advance. Minimum—RTO (how quickly you must restore a service) and RPO (how much data loss is acceptable). These numbers drive architecture: whether VM restart is enough, you need migration-capable clusters, or a second site.

VM backup: what to check first

Resource calculation for a cluster
We will calculate CPU, RAM, IOPS and N+1 spare capacity based on your measurements and growth plan.
Request a calculation

People often confuse snapshots with backups. Snapshots are handy for short tasks (e.g., before an update) but do not replace a full copy. Snapshots live with the VM, grow over time and won't help if the storage or site is lost.

A full backup must survive worst-case scenarios: array failure, ransomware, admin error or fire. So focus on how backups are implemented, not just whether they exist.

Where and how to store copies (3-2-1)

A practical guideline is the 3-2-1 rule: three copies of data, on two different media types, and one copy offsite. In practice this means fast local copies for quick recovery, a separate backup repository, and one copy off the primary site.

Check if this can be done without manual workarounds: isolate the backup repository from the production network, store copies separately from the virtualization storage, set clear retention periods (e.g., 7 days, 4 weeks, 12 months) and have a recovery plan if the primary site is unavailable.

What to back up and how to test restores

Backing up the "whole VM" is convenient but not always sufficient. If a VM runs a database or critical app, you need application-consistent backups to avoid restoring a corrupt DB. Define in policy which VMs are backed up as full images, where application-level backups are required, and which configurations must be included (network settings, templates, access rules).

Also check backup security: encryption of backups, separate access roles, and operation logs (who started or deleted backups, what was restored). For government, banking and healthcare this is often mandatory.

And the thing most often forgotten: regular restore tests. Assign a responsible person and a simple schedule (e.g., once a month restore a random VM and one service from backup). Without testing, backups easily become "a folder of hope."

How to choose a platform: a 2–4 week step-by-step plan

A good platform selection follows the same pattern: requirements and numbers first, then testing real scenarios. That way you avoid buying excess and hitting limits in six months.

A 2–4 week plan that often works:

  1. In 2–3 days collect inventory and requirements: VM list, CPU/RAM/disk, workload types, network and storage requirements. Separately record RTO and RPO for key services.

  2. In 3–5 days pick 2–3 candidates and verify compatibility. Check support for your processors, NICs, HBAs, storage arrays, and management/monitoring capabilities.

  3. Spend 1–2 weeks running a pilot on real scenarios. Build a small cluster, migrate 3–5 representative VMs, enable backup and do a test restore. Test VM migration between hosts, behavior on host failure, and recovery speed.

  4. In 2–3 days calculate TCO and prepare a phased migration plan. Combine licenses, support, storage and network requirements, and downtime costs. Migrate services in waves: noncritical first, then core services, and last the most sensitive ones.

  5. In 2–3 days document operations and acceptance: roles (virtualization, backup, network, security), update procedures, maintenance windows, incident response steps and minimal training.

Example: if you run 1C and a file server at HQ and two branches, the pilot should include not only VM startup but backup and a recovery test in an isolated network. Often at this stage you discover that a promised 15-minute RPO actually requires a different storage and replication approach.

If you need a single vendor for "hardware + deployment + support," it's convenient to work with one systems integrator. For organizations in Kazakhstan local delivery and nationwide service are often important. For example, GSE.kz as a technology manufacturer and systems integrator can cover server and infrastructure selection and 24/7 support, but the initial measurements and requirements should still be yours—then the project becomes predictable in time and budget.

The next step is simple: agree on 2–3 target architecture options, run them against the checklist above, and only then finalize specifications and budget.

Common mistakes when choosing and launching virtualization

Procurement and compliance
We take into account government-sector requirements and proof of origin for purchases in Kazakhstan.
Agree procurement

The first typical mistake is buying servers and storage "just enough" for current VMs. You need capacity for maintenance and HA: host reboots, updates and a single-node failure. If planned without headroom, maintenance becomes risky.

The second mistake is focusing only on hypervisor license price. Ownership costs are often driven by support, updates, training, backup and added features (clustering, migrations, encryption). The "cheap" option can become costly when you must urgently buy missing components.

The third mistake is postponing backups and relying on snapshots. Snapshots don't protect against storage corruption, admin error, or ransomware. In a failure you may end up with only an old copy and recovery measured in days.

Another problem is not verifying hardware and version compatibility. Drivers, NICs, HBAs, firmware, BIOS and controller versions directly affect stability and performance.

Finally, many put the platform into production without rehearsing restores and updates. Pre-test scenarios: one host failing and VM redistribution, restoring a VM from backup and measuring RTO/RPO, updating hypervisor and management components, restoring access rights and network settings, and monitoring free space and snapshot growth.

Quick checklist and next steps for your organization

If time is short, keep this short checklist to avoid immediate mistakes after launch.

Quick pre-decision checklist

  • Resources: real peak CPU/RAM/disk usage plus headroom for growth and maintenance.
  • Licensing and limits: how licenses are counted (hosts, sockets, cores), which features are included (clustering, migration, encryption, management), and what must be purchased separately.
  • High availability (HA): what happens on a single host, switch or storage element failure; how many minutes of downtime are acceptable.
  • Backup: copies separated from the primary storage, regular restore testing, clear retention periods.
  • Security and support: roles and permissions, logging, updates, availability of vendor support and clear response rules.

Mini-scenario: 50–100 VMs and migration without downtime for key services

Suppose you have 50–100 VMs: domain controllers, file services, 1C or another accounting system, mail, several internal web services and test environments. The goal is to migrate to a new platform so critical services don't fail during working hours.

Practical approach: build the new environment in parallel (hosts, network, storage, management, backup), migrate noncritical VMs and test recovery, then move critical systems one by one with agreed maintenance windows and rollback plans. Finally enable HA and verify by actually shutting down a host: services should start on other nodes within the agreed RTO.

To ensure everyone uses the same language, prepare a short document set in advance: target architecture with failure points, resource specs (now and in 12–24 months), roles and access matrix, DR plan (RPO/RTO and recovery scenarios), and a migration plan with success and rollback criteria.

If you prefer a single point of responsibility for "hardware + deployment + support," discuss this with one systems integrator. For organizations in Kazakhstan local supply and nationwide service are often important. Again, GSE.kz can cover server selection, infrastructure and 24/7 support, but initial measurements and requirements should come from you so the project is predictable in time and budget.

Next step: agree on 2–3 target schemes, run them through the checklist above and only then lock the specification and budget.

Choosing a Virtualization Platform: A Checklist for Your Organization | GSE