When is an HSM really necessary, and when can you do without it?

An HSM is needed when you want to reduce the risk of private key leakage or substitution and make key operations auditable. It stores keys inside a protected module and performs crypto operations so the private key never leaves the device.

Which keys and scenarios should be protected first?

Start by protecting the most critical assets: CA keys (root and intermediates), TLS keys for critical services, transaction and document signing keys, code and update signing keys, and keys used to encrypt databases and backups. List where these keys are used and which outages or compromises would be most damaging.

How does an HSM differ from a KMS and from “server-side encryption”?

An HSM is the place where keys live and where crypto operations are executed so a key cannot be trivially stolen from the OS. A KMS is typically a service for key management and policies and may use an HSM as a root of trust. "Encryption on the server" means keys and operations depend on the regular OS, so the compromise risk is usually higher.

Which HSM performance metrics matter in practice and how to request them?

Request not a single "peak" number but tests tailored to your operations, algorithms and modes. Define the load profile (average and peak), target latency, and test conditions including firmware/SDK versions and whether auditing is enabled. That makes results comparable and useful instead of just a brochure number.

Why can an HSM “slow down” at peak even if its rated ops/sec look good?

Signatures and decryption (e.g., TLS), the number of parallel sessions, and latency stability under peak load usually cause slowdowns. RSA vs ECC, key sizes, and enabled policies/auditing also have a big impact. Focus on 95th-percentile latency and behavior during peak windows, not only ops/sec.

Which HA/cluster and failover questions should be asked before purchase?

Start by choosing the scheme: active-active provides scaling and survives node failures without stopping but is more complex; active-standby is simpler but you must know failover time and session behavior. Ask how keys and policies are synchronized, what happens on network partition, and how long applications take to switch. Include failure and maintenance tests in acceptance criteria.

How to verify HSM compatibility with PKI (e.g., Microsoft AD CS or EJBCA)?

Document your CA/RA and tech stack, then verify support for required interfaces—commonly PKCS#11, JCE for Java, CNG/KSP for Windows and AD CS, and KMIP if key management is needed outside PKI. Request a compatibility matrix for your OS and virtualization versions because driver-level issues often surface there. A minimal test CA with typical certificate templates usually shows whether integration will work.

How to organize key lifecycle correctly: generation, import, rotation and destruction?

Prefer generating keys inside the HSM so the private key never leaves the protected environment. Imports are needed for migrations or legacy systems but should be tightly controlled by policy and roles. Plan rotation, backups and recovery so they can be executed by procedure and proven in logs.

How to prepare for emergency key replacement and migration to a new HSM?

Create a short runbook: who declares an incident, who participates in critical operations, and how quickly a new key is issued and deployed. Check whether old and new keys can run in parallel during cutover and how certificates are revoked so the old key stops being used. Rehearse these steps in a pilot—otherwise real replacement happens under stress and with downtime risk.

HSM for Key Management: Procurement Questions Checklist

Why you need an HSM and what exactly to protect

People buy HSMs not for the “hardware” itself but to mitigate clear risks. The main one is leaking or tampering with cryptographic keys. If keys live in files, a database or server memory, they are easier to steal during a breach, by an admin mistake or via malware. An HSM stores keys inside a protected module and performs operations so that the private key never leaves the device.

Another common motive is auditability and regulator requirements. When you need to prove who created a key and when, who had access, how rotation and destruction were performed, "just encrypting on the server" usually doesn’t provide the required level of verifiability. With an HSM it's easier to implement roles, auditing and procedures.

Before comparing Thales, Entrust and Utimaco, fix the scenarios you actually need to protect. Otherwise procurement turns into a debate about abstract specs, and later you discover missing interfaces, no HA mode or incorrect separation of duties.

A practical start is to list where keys are used and who is responsible. Most often this is PKI and certificate issuance, TLS keys for critical services, code and update signing, database and backup encryption, and integrations with external systems and government services.

It’s important to distinguish terms. An HSM is a protected module that stores keys and performs crypto operations. A KMS is more about key and policy management at the service level and sometimes uses an HSM as a root of trust. “Encryption on the server” means keys and operations live in a regular OS, which carries a higher compromise risk.

Example: an organization has an internal CA and services in a datacenter. If strict auditability and separation of duties are required (common in government and finance), describe in advance which keys are most critical, what downtime is unacceptable and which reports an auditor will request in a year.

Gather requirements before comparing Thales, Entrust, Utimaco

Selection mistakes usually start not with the vendor but with vague expectations. Define what operations the HSM will perform daily and what you will consider success in 6–12 months.

Collect a list of scenarios and mark what matters most in each.

TLS for web services and APIs often hinges on operation rate, latency and easy connection to load balancers or web servers. Document and transaction signing depends more on roles, audit and issuance/revocation procedures. Database encryption often comes down to integration with specific DBMSs and where applications run (on-prem, VMs, containers). If you run your own CA, describe the HSM scenario for the CA separately: certificate issuance, storage of root and intermediate keys, ceremony requirements and access by multiple responsible parties.

Next, map applications and platforms. The same HSM may be great for PKI but inconvenient for a container environment if there’s no supported integration for your stack.

For vendor meetings a short questionnaire helps:

Which systems will use the HSM and via which interfaces.
Which OSes and environments are mandatory (physical servers, virtualization, containers) and who is responsible for post-update support.
How many sites are needed and are there requirements for geographic separation or operation during loss of connectivity.
Expected growth: number of keys, ops/sec, services and admin users.
Procurement and operational constraints (certifications, regulator requirements, maintenance windows, 24/7 operation).

Example: a bank with two datacenters in Kazakhstan may start with TLS and payment signing, then add an internal CA next year. If that isn’t recorded upfront, you can pick a device that supports the crypto but not required sites or scaling.

Performance: which numbers to request and how not to be misled

HSM performance rarely reduces to a single brochure figure. Identify which operations will be bottlenecks: signing (TLS, documents), decryption, key generation, certificate issuance, and session handling for containers.

Start with a simple calculation: how many operations per second you need on average and at peak, and what latency the most critical service tolerates. For example, a portal does 300 signatures/sec in normal hours and up to 3,000 during reporting. If the SLA requires responses within 200–300 ms, you need not just peak throughput but stable latency at peak.

How to request a benchmark that is useful

Ask for tests on your algorithms, key sizes and modes, not a generic RSA/ECC number. RSA-2048 vs RSA-4096 differ by orders of magnitude, as do ECDSA P-256 vs P-384. Test conditions matter: operational mode, enabled auditing, network vs PCIe connection, and number of parallel sessions.

In your vendor request (Thales, Entrust, Utimaco or any other) lock down and later verify in the report:

Load profile: operations, their share, required latency, peak duration.
Crypto parameters: algorithms, key sizes, modes (sign or decrypt).
Test stand conditions: firmware version, drivers, SDK, security mode, enabled policies and auditing.
Architecture: network HSM or PCIe, number of clients and sessions, channels, protection of the channel between app and HSM.
Licenses and limits: optional modules that affect parallelism and performance.

Don’t accept "up to N ops/sec" without conditions. If the report lacks mode, versions and settings, that number doesn’t help decision-making.

Practical scenario: if a CA issues certificates in batches in the morning and the module serves TLS for internal services in the afternoon, request two test profiles and verify one workload doesn’t degrade when the other runs.

If you procure HSMs as part of an integration project, arrange a pilot on your workload and apps. Integrators like GSE.kz can help build a testbench, reproduce peak requests and compare results under equal conditions.

HA and fault tolerance: cluster and recovery questions

If the HSM supports critical services (signing, TLS, DB encryption), any outage can become a system-wide outage. Discuss HA before choosing model and licenses. You are buying not just a device but an operational scheme for failures.

How the cluster works in practice

First clarify which mode you need.

Active-active usually provides scaling and survives node failures without stopping, but it’s more complex and requires synchronization. Active-standby is simpler, but understand failover time and what happens to existing application sessions.

Ask vendors direct questions and get written answers with conditions:

Which modes are supported and what the license covers (cluster, node count, distance limits between sites).
What happens on network partition, reboot, power failure: is failover automatic and how do applications behave.
How keys and policies are synchronized: is a separate secure channel required, acceptable latency, and behavior on desynchronization.
What RTO and RPO you can expect for your services.
How planned maintenance is handled with no downtime: updates, node replacement, certificate rotation, and failover testing.

Tests and operations

Request test scenarios: simulate a node failure, degraded performance, full power-off recovery, and return to normal. Test not only the HSM but client connectors, PKI and applications.

Example: a bank has two nodes in one datacenter and a backup at another site. The question is not just whether the system survives a node loss but whether signature throughput remains sufficient at peak and whether the support team can quickly confirm keys and rights are synchronized.

PKI integrations: CA, interfaces and compatibility

HSM pilot on your workload

We’ll build a testbench and verify latency, integrations, and logs before purchase.

Request a pilot

A good HSM won’t help if it doesn’t integrate well with your PKI. Start by stating your current CA and RA and future plans. Integration with Microsoft AD CS, EJBCA or other CAs often determines which modules and clients you need.

Ask for an exact list of supported interfaces and modes. Common checks: PKCS#11, JCE for Java, CNG/KSP for Windows and AD CS, and KMIP if the module participates in key management outside PKI. Also confirm compatibility with your OS and virtualization versions.

Then review CA key issuance and storage and OCSP. Key questions: are keys generated inside the HSM, can export be disallowed, how are operator roles configured and which operations require dual control. Verify algorithm and key-length constraints not from marketing slides but against client and provider limits.

Limits often appear at the driver level: PKCS#11 versions, KSP specifics, or incompatibility with certain Windows Server or Linux builds. Request a compatibility matrix and a list of known issues.

To avoid blind purchases, agree a test plan and a minimal testbench. Usually one test CA with a typical certificate template, OCSP (if used), a sample application, issuance/load tests and logging checks for typical operations are sufficient.

If you lack a lab, an integrator like GSE.kz can usually assemble such a bench and document results before purchase.

Key lifecycle: creation, storage, rotation, destruction

Most HSM-related risks stem from the key lifecycle, not cryptography itself. If creation, storage and rotation rules aren’t defined, a good module can turn into a source of outages and audit issues.

First question: where are keys born. For most tasks, best practice is generation inside the HSM so the private key never leaves the protected environment. Importing keys is needed for migration or legacy systems; decide in advance which imports are banned, which allowed and who may perform them.

Next—rotation. Time-based rotation (every 6–12 months) is not the only trigger. Events like account compromise, employee termination, algorithm changes or new regulations also require rotation. Plan mass replacement and assess impact on TLS, document signing, DB encryption and tokens.

Backup is often underestimated. Understand how backups are made, where they are stored, who has access and whether you can recover to another device or site.

Mini checklist of questions:

Are keys generated only inside the HSM or is import allowed? What import restrictions are enforced by policy?
How is rotation organized: is batch replacement possible and can it be done safely without downtime?
How are backups and restores performed: encryption, access control, compatibility between models?
How is key destruction handled: are there confirmations (logs, reports) to show an auditor?
How are roles separated (admin, security officer, operator, auditor) and which actions require multi-person control?

Example: a bank rotates payment signing keys. If rotation relies on manual work by a single admin, you get a one-person risk window. With role separation and formal procedures, replacement follows the plan and leaves a clear trail.

Replacement procedures and emergency scenarios (key rollover)

Key replacement often comes up at the worst moment: compromise, expiry or migration to a new module. So understand not only “how keys are created” but how you act when a key must be replaced quickly and safely.

If a key is compromised: who decides and what is done

Agree in advance who can declare an incident and start replacement: service owner, InfoSec, PKI admin, or shift lead. Without this, precious time is lost in approvals.

Ask the vendor and integrator to outline the steps and clearly show what is quick and what requires manual work:

who participates and how many people are needed for key-critical operations
how fast a new key can be issued and how correct installation is verified
how certificates are revoked and how you ensure the old key is no longer used
how the process is recorded in logs for audit
typical timelines per step (preferably in a table)

Replacement without downtime and migration to a new HSM

"No downtime" often means no HSM downtime, but applications still need switching. Clarify whether old and new keys can operate in parallel, how long you can keep two active keys, and what happens to client sessions.

In migration, the key question is whether keys can be moved and in what form. If export is forbidden by policy, you’ll need to reissue certificates and change dependencies. Ask directly: are key wrapping formats compatible, are key attributes preserved, and what if you change vendors in a year?

Also account for operational details that break plans: delivery and replacement lead times, availability of spare parts, and whether a spare unit or at least spare PSUs/network modules is required.

The outcome should be documents, not oral agreements: a short runbook with roles, steps, timing and checkpoints, plus a separate emergency migration instruction.

Security and audit: access, logs, compliance

HSM benchmark for your operations

We’ll compare vendors against your algorithms, keys, and peak profile.

Order test

HSM security is not only "crypto inside the box" but also how you later prove who did what. Address audit and access questions before the pilot, not after procurement.

Access: who can do what

Start with a role model. Verify whether the device enforces separation of duties and dual control: for example, one admin should not be able alone to create a master key, change policies and export a backup. Ask about strong admin authentication (smart cards, hardware tokens, MFA) and how operator credentials are stored and rotated.

Practical question: can you define crypto-officer, operator, auditor and integration admin roles, and are some role combinations forbidden?

Logs and verifiability

You need events beyond "successful login": all key and policy operations should be logged—creation, import/export (if allowed), activation, rotation, destruction, role changes and forbidden attempts. Ask how logs are protected from tampering, whether they are signed or chained for integrity, and how they are exported to a SIEM including during connectivity loss.

Lock expectations with the vendor on:

which events are logged by default and what can be added
how quickly logs reach an external system and what happens on overflow
which audit reports are available (by key, admin, policy changes)
how dual-control is evidenced in reports
retention periods on-device and off-device

Example: auditors in a bank or government may ask to see the chain of actions for a certificate key: who created it, who approved it, who put it into production and when rotation occurred. If such reports cannot be produced natively or exported to SIEM, manual work and audit findings follow.

Common procurement and deployment pitfalls

A typical mistake is assuming that because a device is certified and "supports crypto", everything will work. In practice complexity lies in load, processes and integrations.

Ask the vendor or integrator to list common failures for your scenario: transaction signing, certificate issuance, DB encryption, TLS keys. A good answer is a list of real issues and mitigation steps, not marketing wording.

Load is often underestimated: bulk signing in the morning and end-of-day peaks create pressure. If the application queues operations, users notice latency and teams fix it with workarounds. Request numbers and test conditions: algorithms, key sizes, parallel sessions, 95th-percentile latency.

Second pitfall—roles and procedures. When "the password is with one person" and initialization, recovery, import/export and backup procedures are undocumented and untested, you get risk and downtime. Procure processes as well as the device.

Third source of technical debt—no rotation and replacement plan. Keys and certificates live for years. Without schedules, maintenance windows and responsibility defined, replacement becomes a risky project in 2–3 years.

Before signing, check:

which 3–5 typical deployment mistakes the supplier sees and how to avoid them
how performance will be validated on your bench and who owns the methodology
which roles are required (M of N, separation of duties) and which procedures must be approved
what the rotation and destruction plan looks like and what happens on certificate expiry
which integrations are supported and which need bench testing

If you work through an SI, lock these points into the work and acceptance plan instead of leaving them for later.

Short checklist for RFPs and vendor meetings

HSM cluster for two data centers

We’ll design active-active or standby and rehearse failover without service downtime.

Plan HA

To compare Thales, Entrust and Utimaco fairly, ask the same questions in the same terms and request answers with numbers, diagrams and test conditions. It’s important not only whether a feature is supported, but how it behaves under your load.

10 questions that quickly clarify the picture

Agree a typical scenario before the meeting: the most common operations, how many apps will connect and expected growth in 2–3 years.

Performance: which operations were measured, on which model and firmware, with what settings, and what is the 95th-percentile latency.
Scaling: how throughput changes with more sessions/clients, and are there connection or license limits.
HA and maintenance: cluster scheme, ability to update and replace nodes without downtime, declared RTO/RPO and failure-test details.
Integrations: available interfaces (PKCS#11, KMIP, etc.), which PKI/CA were tested for compatibility, and who supports connectors.
Key lifecycle: rotation and policy mechanisms, what exactly is backed up, secure destruction, and M-of-N implementation.

What to ask for as proof, not a promise

Unverifiable answers are hard to defend at procurement or audit. Request:

a load test report (conditions, operation profiles, latency results)
an HA diagram with failover description and recovery procedures
a compatibility matrix (your OS, hypervisor, PKI, apps) and list of limitations
draft procedures: commissioning, rotation, emergency key replacement, role control
log descriptions and SIEM integration: which events are logged, how integrity is ensured, retention

If your team cannot deeply validate responses, involve an integrator early: nuances in tests, HA and operations often surface only in hands-on verification.

Example scenario and next steps for deployment

Imagine a bank or government body: an internal PKI issues certificates for staff and systems, plus document signing, channel encryption and continuity requirements (signing must work during node update or failure).

Make procurement verifiable by phrasing requirements as tests. For example: "the cluster survives a single-node failure without service interruption", "rotation follows the procedure and doesn’t break integrations", "recovery after device loss is possible using a multi-role procedure". Define what counts as downtime—e.g., signature unavailability for critical systems longer than 1 minute.

A 1–2 week pilot should focus on real operations. Deliver measurable results: performance on typical operations, HA verification (node shutdown, failover, recovery), rotation scenario, log and audit trail checks, plus clear admin and on-call instructions.

Then proceed in short steps: approve 3–5 key scenarios, prepare an RFP and build a bench as close to prod as possible. Assign owners for procedures (PKI, infra, security) and those who will accept tests.

Integration can be done by your team but often goes faster with a system integrator experienced in PKI and datacenters with 24/7 support. In Kazakhstan, projects including design, deployment and ongoing operation with a vendor-neutral approach can be delivered by GSE.kz.