Jun 07, 2025·8 min

PKI Server: signing capacity, redundancy and HSM considerations for getting started

How to choose a PKI server: estimate signing load, calculate peak issuance, the role of HSM, redundancy options and readiness checks.

PKI Server: signing capacity, redundancy and HSM considerations for getting started

Where choosing a PKI server begins

You don't choose a PKI server by headcount, but by the operations the system must handle in the busiest hours. During mass issuance a PKI receives and validates requests, creates certificates, signs them, publishes data to repositories and keeps logs. The most common bottleneck is signing and the operations around it.

A typical early mistake is to look only at the average request rate. For PKI, peaks are almost always more important: morning employee logins, mass token replacement, reissues before expiration, migration to a new algorithm, or opening a branch. In such windows a 30–60 second delay quickly becomes an hours-long queue.

The most painful failures are often not "the server crashed" but the chain of consequences: dependent services can't issue or validate certificates, issuance time rises (users and services retry requests and overload the system further), and desynchronization between components (a certificate is issued but not published on time; CRL/OCSP updates are delayed).

Before buying hardware and especially an HSM, document several decisions in writing. Which mass operation scenarios are realistic (initial issuance, scheduled reissue, revocation, service certificates)? What is the time objective — how many seconds is acceptable to issue one certificate at peak? Will there be one CA or several (separation by certificate types and trust zones)? How will you endure failures (hot spare, cold spare, or scheduled downtime acceptable)? Do you need an HSM immediately, and what matters more: maximum security, maximum signing speed, or a balance?

A simple guideline: if you reissue 30,000 certificates once a quarter over two working days, the server can be "quiet" all month but critical during those two days. That is where capacity calculation and architecture choices start.

PKI components and where load appears

PKI almost never reduces to a single server. Even if you deploy everything on one node at the start, different roles live inside it and their loads differ: some hit cryptography, some the database, and some the network and disks.

Typical components include a certification authority (CA) that signs certificates and revocation lists; a registration authority (RA) that verifies requests and identities; OCSP for online status checks; CRL publication; an application portal or API; a database for requests and state; and audit logs.

Where bottlenecks most often appear

The most noticeable load usually arises in three places:

  • signing and related operations at the CA (especially during mass issuance or reissue);
  • OCSP, when many clients check certificate status regularly;
  • the database and disks when detailed logging, request storage and frequent writes from the RA and portal are enabled.

OCSP often stresses network and request handling rather than cryptography. CRL publication can suddenly become heavy on disk and CPU at publish time if the list is large or updates are frequent.

What can be combined and what should be separated

In small deployments RA, portal and database are often combined to simplify administration and reduce cost. But some roles are better separated sooner rather than later. The signing CA is usually made separate and more protected, sometimes completely isolated from the network. OCSP is worth placing on a separate node if you have many workstations, VPN clients or services that check certificates on each connection.

Regulatory and audit requirements strongly influence architecture: separate administrator roles, immutable logs, long retention of events, precise time synchronization and strict key access control. This adds load on storage, backups and control processes even when certificate issuance is not daily.

How to describe real load: not only averages

Average load almost always deceives. PKI can be "silent" for hours and then receive a flood of requests in 10 minutes that multiplies the signing queue. It is therefore more convenient to describe load as a set of scenarios with peaks and time constraints.

Describe what will actually happen: initial issuance (onboarding employees or devices), scheduled reissue, mass replacement due to a crypto policy change, revocation and emergency reissue after an incident. For each scenario record not only the total count but also the time window in which it must be completed.

Which metrics to collect

Instead of the vague "lots of requests" keep 2–3 clear numbers:

  • certificates per hour for key scenarios;
  • signing operations per second at peak (not average);
  • acceptable queue length (how many requests can wait without violating SLAs).

Example: if 2,000 employees arrive at 9:30 and must get access, the relevant metric is not daily volume but the window from 9:00 to 10:00. Even if the daily number is "2,000 per day", the real task is "2,000 within 60 minutes".

Where peaks occur

Peaks are often tied to routine events: morning logins, shift starts, mass updates, academic term starts, and reporting deadlines. Also note campaigns: annual certificate replacement and emergency reissue "by the end of day".

Validity periods and reissue policy directly determine workload. Short lifetimes (e.g., 3–6 months) and strict rotation policies increase signing and status-check frequency. If reissue is allowed "30 days before expiration" load spreads out. If "3 days" is allowed, you will create a peak yourself.

At this stage decide whether the scheme includes an HSM: it improves security but changes performance and queue planning, especially during mass issuance.

Estimating signing operation load

In PKI load is rarely just "certificates per day." Different operations have different cost and therefore different requirements for the PKI server and crypto components.

Split signing into three streams: certificate signing (issuance and reissue), OCSP response signing (real-time status checks), and CRL signing (revocation list publication). OCSP often imposes a steady "small" load while mass issuance creates peaks.

Algorithm and key parameters greatly affect speed. RSA-2048 is widely compatible but signing cost grows with key length. ECDSA often produces smaller signatures and good signing speed but has greater compatibility considerations. Therefore evaluate not in abstract "operations per second" but in concrete operations for chosen algorithms and key sizes.

Next determine concurrency: how many requests arrive simultaneously and what latency is acceptable. For OCSP, stable response time is often more important than occasional peak throughput.

It is convenient to summarize estimates in a table and align them with the business:

OperationFrequency (avg/peak)Target response timeCriticality
Certificate signinge.g., 2,000/hour during a mass reissueup to 2–5 shigh during campaign
OCSP signinge.g., 50–200 req/s during working hoursup to 200–500 mscontinuously critical
CRL signinge.g., once every 1–6 hoursminutes acceptablemedium

To avoid mistakes check four things: how many applications and users actually perform status checks (OCSP); whether there will be batch operations (mass reissue, onboarding); whether there are tight SLAs for key systems; and which algorithms/key lengths are approved by policy or regulators.

If an organization simultaneously runs a 30,000 certificate reissue while OCSP checks for workstations and servers continue, count these streams separately. Otherwise you will either overprovision or experience unstable OCSP responses on normal days.

HSM at the start: what to check before design

An HSM (hardware security module) is needed not only to "hide" keys. It sets the rules for the whole project: where CA keys live, how crypto signing is performed and how to later prove who signed what (audit and operation logs).

Before design decide which operations must run inside the HSM. Usually the most sensitive keys (e.g., issuing CA keys) go inside the HSM while less critical operations stay on the server. This affects security, cost and performance.

HSM parameters that often become bottlenecks

On paper an HSM can "support PKI", but in real mass issuance concrete limits matter:

  • signing performance (signatures per second for your algorithm and key size);
  • number of concurrent sessions and the rate of opening new sessions;
  • high availability mode: active‑active or active‑passive, and failover behavior;
  • audit: whether protected logs are enabled, how to export and where to store them.

Example: if you need to issue 10,000 certificates in an hour at peak, average load looks manageable. But with waves (morning logins plus service certificate updates) the HSM may be limited not by server CPU but by signature or session limits.

How not to hit limits a year later

Plan for growth: new departments, integrations, automatic rotation, more service certificates. Include a margin for HSM performance and sessions, and check scalability: extra licenses, adding a second HSM, or moving to a clustered HSM.

Typical deployment options: HSM inside the same server (compact and simple but harder for availability), a separate device in the rack (easier for maintenance and HA), or an HSM cluster for large loads and strict availability requirements.

If you choose a server with an HSM from the start, tie that choice to redundancy: HSM failure often stops issuance and revocation even if other servers keep running.

Selecting the server and infrastructure for PKI

24/7 support and maintenance
We will set up operations and connect 24/7 support with a national service network.
Plan support

When choosing a server for PKI look at where the bottleneck will be: signing, database, log storage, CRL publication and OCSP responses. The system is often quiet for months and then hits resource limits during mass issuance or reissue.

CPU, memory, disk and network: what will be limiting

CPU matters for two reasons. Clock speed is important when signing and some crypto operations are sequential and you need minimal per-request latency. Core count matters when issuance runs in parallel (multiple workers, request queues) and the database and publication services run alongside. If signing is offloaded to an HSM, server CPU load drops but requirements for network stability and latency between server and HSM increase.

Memory is usually used by DB caches, queues and service processes. In a peak, lack of RAM causes swapping and a persistently slow PKI even with a good CPU.

Disk often surprises teams. Logs, databases, queues, audit, backups and retention require not only capacity but stable IOPS. Separate fast volumes for database and logs often bring more benefit than simply "bigger SSDs." Decide early where logs are stored (locally or centrally), how many months to retain them and how quickly you must restore from backup.

Network requires predictable latency. PKI needs time synchronization (NTP) and access to segments where clients, directories and publication points live. Segmentation reduces risk but adds routing and access requirements.

Environment and trust boundary separation

To avoid stopping issuance due to tests or administration, separate at least logically: production (issuance, CRL/OCSP, DB), test/sandbox for updates and maintenance, the signing boundary (where signing and key storage live), and an admin boundary (strict access rules and audit).

For a quick start teams often choose a standard server node and scale later. For projects in Kazakhstan it can be reasonable to consider locally manufactured rack servers like the GSE S200 and then tailor storage and network to your peak issuance and audit requirements.

Redundancy strategy: what really affects availability

PKI availability usually fails because of wrong expectations, not capacity. Start with simple goals: how much downtime you can tolerate (RTO) and how much data loss is acceptable (RPO). These figures differ across PKI components because issuance, OCSP and CRL follow different rules.

Document requirements separately: for issuance and reissue—how many minutes or hours of downtime are acceptable during a business day; for OCSP—how many seconds of downtime are acceptable before checks start failing at scale; for CRL—how often it's updated and how quickly it must be available after a failure; for admin and audit—how fast access to logs and reports must be restored.

Choose a redundancy model based on those numbers, not "by habit." Active‑standby is simpler and predictable, often suitable for a CA. Active‑active fits where many checks occur (e.g., OCSP) but requires careful synchronization and change discipline. Geo‑redundancy and distributed sites make sense when site failure is unacceptable, but consider latency, links and the failover procedure.

What to always back up:

  • HSM and its keys (HSM cluster or spare module, plus a clear recovery procedure);
  • the database if in use (replication, backups and tested restores);
  • service configurations, policies and certificate templates;
  • audit logs and security events (store separately and protect against modification);
  • publications (CRL, OCSP responses) and checks for freshness.

Remove single points of failure around PKI: two independent power feeds, redundant network and routing, and a reliable time source. Time synchronization errors can "break" certificate validation as badly as a server outage.

Step‑by‑step plan to choose and verify capacity

Turnkey PKI integration
We will take on turnkey PKI and infrastructure integration under regulatory requirements.
Start implementation

Start the plan with numbers and operating rules, not hardware, so your PKI server is sized to real signing and peaks rather than guessed headroom.

Steps before procurement

  1. Collect scenarios and numbers: how many users and systems, certificates per year, peak windows (e.g., Monday morning or mass reissue), and availability/response time requirements (SLA).

  2. Describe architecture roles and trust boundaries: where the root CA will be, where issuing CAs will be, where OCSP/CRL live, and who administers what. This determines which parts can scale independently.

  3. Clarify the HSM role and redundancy of key nodes. Fix which operations go into the HSM (certificate signing, CRL and OCSP signing), limits on performance and sessions, failover options (second HSM, cluster, cold replacement) and the switchover procedure.

  4. Describe operations, otherwise capacity calculations quickly fail in practice: monitor signing queues and errors, backup configuration and DB per procedure, maintenance windows and rotation plans, logging and access control for audits.

  5. Run a pilot with load testing and refine the specification. Model at least 10–20% of the peak: mass request submission, concurrent signing, CRL publication. Based on results decide whether to scale CPU, disks, network or HSM. If buying from an integrator, ask for a test bench or pilot build to confirm numbers before delivery.

Typical mistakes when designing a PKI server

The most common mistake is sizing by the daily average. PKI lives on peaks: reissues before expiration, branch onboarding, policy changes, or urgent key replacement. Designing for the average leads to signing queues during peaks, user delays and potentially service outages.

Another problem is choosing an HSM that is "generally suitable" but not checking details. Verify limits on parallel sessions and signing operations, HA modes, and failover procedures beforehand. An HSM might look fast on paper but be constrained by connection limits or an inconvenient recovery process.

A further risk is placing all roles on one node "for simplicity." CA, DB, CRL/OCSP publication, web registration and monitoring on one machine make failures expensive: one component fails and everything becomes unavailable.

Often overlooked: the volume and growth rate of audit logs, request archives, CRLs and their retention; recovery time not only for the server but for keys, policies and trust chains; regular recovery tests (not just “backups exist” but “we restored and signed a test certificate”); campaign runbooks and maintenance windows; separate environments and admin accesses to prevent accidental changes.

If you buy servers and integration from a vendor or integrator like GSE.kz, ask them to show a typical high‑availability schematic and a recovery test plan in advance. This often saves months of fixes after the first real peaks.

Short checklist before procurement

Before picking a concrete server configuration, gather facts to reduce the chance of buying a powerful but unsuitable solution — for example an HSM without required availability or an OCSP bottleneck.

Load and peaks

Document peaks and what happens simultaneously. It is convenient to frame this as numbers for the worst 10–15 minute window.

  • Peak operation counts: issuance and reissue, OCSP responses, CRL publication/downloads, admin actions (audit, reports, backups).
  • Clarity on priorities: is campaign issuance speed more important than stable daily OCSP?

Crypto parameters and HSM

Agree crypto parameters with business owners and security. Changing algorithms or key lengths later is usually more expensive than accounting for them upfront.

  • A list of algorithms, key sizes, formats (RSA/ECDSA), lifetimes, certificate profiles and compatibility requirements.
  • HSM selected by performance (signatures/sec), HA mode, key management (backup/restore, role separation, m-of-n) and how key rotation is performed.

Availability, processes and control

Define what counts as downtime and what data loss is acceptable. This directly affects redundancy design and budget.

  • RTO/RPO and site redundancy are defined: what is duplicated, how to fail over, and which components can operate read‑only.
  • Monitoring is planned (queues, signing latency, HSM health), updates and maintenance windows, role-based access rules and audit (who approves key and policy operations).

If procuring infrastructure from a local vendor and integrator, clarify support responsibilities 24/7 and replacement timelines for failed nodes. Otherwise the designed high availability may remain theoretical.

Example scenario: mass reissue in an organization

Pilot with load testing
We will run a pilot and load test on your mass reissue scenario.
Order a pilot

Imagine a bank or agency runs a reissue campaign for 20,000 employees. Users were warned, but many arrive in the first hours: without a certificate mail, VPN or internal systems don't work.

Estimate the peak. Suppose 50% of staff (10,000) try to reissue within the first 2 hours. Average throughput would be 10,000 / (2 * 3600) = 1.39 requests per second. However load is uneven: the first 10–15 minutes can be 3–5 times higher. For calculation use a peak estimate, e.g., 7 requests per second.

Then determine how many signing operations each request triggers. If issuance includes signing the certificate and signing a response/token, assume 2 signatures per request. At the peak you would need about 14 signatures per second plus a 30–50% margin for retries, client errors and queueing.

To avoid a single point bottleneck split roles across nodes:

  • CA node (issuance) with HSM access;
  • separate OCSP node closer to clients;
  • separate database/log server (or dedicated VM/instance);
  • HSM in HA if availability is critical.

Minimum redundancy to keep issuance running: a backup CA (cold or warm per procedure), duplicated OCSP, tested DB restore and a spare path to HSM (second card/module or HA cluster).

Before the campaign run three tests: a peak load run with real certificate templates, fail a node (for example OCSP or DB) and measure recovery (time to resume issuance and log integrity). This is cheaper than discovering system limits on the first live day.

Next steps: from calculation to implementation

After calculations document what you are protecting and acceptable downtime. For PKI this is not only "fast signing" but predictable behavior under peaks: mass issuance, reissue, revocation, CRL updates and OCSP handling.

Collect inputs and approve rules: who has key access, how key changes are performed, who approves issuance, and how long recovery may take. These decisions influence architecture more than the number of CPU cores.

Then prepare a specification for the PKI server, HSM and redundancy. Summarize it in a short document:

  • target SLAs for availability and recovery (RTO/RPO);
  • peak signing load (transactions per second) and peak duration;
  • HSM requirements (interfaces, clustering, operation limits, key backup/restore, m-of-n procedures);
  • failover scheme (active‑active or active‑passive) and switch points;
  • logging, time, backup and log retention requirements.

Before production launch run a pilot and load tests. Test your peak scenario, e.g., concurrent reissue across multiple branches plus CRL publication. In the pilot watch HSM behavior under queues, node failover effects and manual steps (PIN entry or involvement of a trusted person) durations.

If your team lacks PKI design experience, engage a system integrator during design and testing rather than after failures. For projects that value local manufacturing and on-site support consider infrastructure and integration from GSE.kz: the company offers S200 rack servers, data center solutions and 24/7 technical support to help move from planning to stable operation faster.

FAQ

Where should I start choosing a PKI server if I don't have exact numbers yet?

Start from peak scenarios, not an average day. Record the windows when requests arrive simultaneously (morning logins, reissue campaigns, branch onboarding) and set a time target: how many seconds are acceptable for issuance at the peak and what queue length is tolerable.

How do I correctly estimate signing load instead of just “certificates per day”?

Count certificate signing, OCSP response signing and CRL signing separately — they have different load profiles. Convert your campaign into "signing operations per second" for the worst 10–15 minutes and add a margin for retries and client errors.

Why can't I design PKI based on the average number of requests?

Because PKI operates in peaks. If the system handles the average load but fails during morning logins or a mass reissue, delays quickly turn into hours-long queues and start breaking dependent services due to retries and timeouts.

Where do bottlenecks usually appear in PKI?

Most often the bottlenecks are CA signing, handling many short OCSP requests, and the disk subsystem/database because of logs and audit. In practice, slowness is caused not by a server crash but by queues, delayed CRL publication and desynchronization between components.

Which PKI roles can be combined on one server and which should be separated?

At the start teams sometimes combine RA, portal and database to simplify launch. The CA that performs signing is usually better isolated and more protected, while OCSP should be put on a separate node when many clients check status during working hours and stable response latency is important.

How to know if OCSP infrastructure will cope with a large number of checks?

For OCSP predictable response time and stability under high request counts matter more than a one-off peak throughput number. Practically, measure the target latency in milliseconds during working hours and ensure clients don't begin to retry en masse on brief slowdowns.

Do I need an HSM right away, and how does it affect server choice?

HSM adds security and changes performance: the server may be fast but limited by HSM signing throughput or session counts. Before procurement decide which operations must be done inside the HSM, what availability mode you need and how failover will work.

Which HSM parameters most often become problematic at peak?

Look not only at "signatures per second" but at the performance for your chosen algorithm and key size, plus limits on parallel sessions and session open rate. Also verify audit, key backup and the recovery procedure — these often determine the real availability of PKI.

How to choose a redundancy strategy for PKI?

Define RTO/RPO separately for issuance, OCSP and CRL publication because their criticality and acceptable downtime differ. Typically CA fits active-standby, while OCSP can be load‑balanced; but HSM or database failure can halt issuance even if other services are running.

Which tests should I run before industrial PKI launch?

The load test should simulate your peak: mass request submission, parallel signing, concurrent OCSP checks and CRL publication. Also include a failover test (OCSP node, database or HSM access) and a recovery test to confirm not only speed but manageability and audit integrity.

PKI Server: signing capacity, redundancy and HSM considerations for getting started | GSE