Why can't scoring be evaluated only by average latency?

Focus on *p95* and especially *p99* across the whole request chain, not the mean. The tails are what cause visible "hangs", timeouts and conversion losses, even when p50 looks great.

How to properly break down scoring latency into stages?

Build a "latency budget" by steps: request intake, feature retrieval, inference, rules, logging. For each stage record target p50/p95/p99 under a realistic load profile — otherwise the debate about CPU/RAM/NVMe has no data to rely on.

Which matters more for inference: CPU frequency or number of cores?

As a rule of thumb, prefer CPUs with high per-core frequency and predictable behavior under load. Many cores matter when there are many concurrent requests or heavy post-processing, but under overload queues and context switches often worsen p99.

What is NUMA and why does it make p99 drift?

NUMA matters when a process runs on one socket while memory is allocated on another — accessing that RAM is slower and latency starts to "jump." A practical pilot approach is to keep a single inference service within one NUMA node and pin CPU and memory, then compare p95/p99 before and after.

Should turbo and power-saving settings be changed for latency?

Turning off power-saving or tweaking turbo should be done after measurement: these modes can improve averages but worsen stability under sustained load. Check frequencies and throttling over long runs and compare tails, not just p50.

How much RAM should be provisioned to avoid rare long latencies?

If RAM is tight, the system can look normal for a long time and then produce rare but very long latencies due to GC, allocations and page faults. A safe guideline is to keep a noticeable memory headroom so peaks don't push the system into swap and raise p99.

When should one choose GPU for scoring and when is CPU sufficient?

CPU is usually better for compact models and strict single-request latency SLAs because it’s easier to keep p99 steady. GPU makes sense when the model is heavy and you can keep the device highly utilized; batching increases throughput but can hurt p99 due to queuing delays.

How can the network ruin p99 even if inference is fast?

Network tails occur because of short queue buildups, microoverloads and background traffic, so "on average it’s fast" but p99 suddenly spikes. Segment traffic types, and test packet loss, retransmits and jitter under load; also ensure clock synchronization across nodes.

Where does disk become the bottleneck: features or logs, and does NVMe help?

p95/p99 often grow because of feature access and synchronous logging, not the model itself. NVMe helps with many small random reads and stable latency. Buffer logs and, if possible, place logs and feature storage on separate volumes so writes don't block feature reads.

How to run a hardware pilot so the results are reproducible before procurement?

Decide pass/fail on p99, timeouts and fallback rates in advance; run not only steady traffic but peaks and spikes; measure stages separately. If testing on a GSE S200 server, save BIOS, power profiles and NUMA pinning so the result can be reproduced during procurement and rollout.

On-prem infrastructure for scoring and anti-fraud: what matters in hardware

Latency and model quality: where hardware becomes the limit

Scoring and anti-fraud almost always operate in a "respond now" mode. You need not only to compute a probability, but to do it in time: accept the request, collect features, run inference, apply rules, return the decision and record the event in logs.

The main trap is looking only at average latency. Users and the business get hit by the tail of the distribution: p95 and especially p99. If p99 jumps, you get rare but noticeable "stalls": a payment fails, a customer calls support, an extra check triggers, or conversion drops.

Hardware affects not only milliseconds but also decision quality. When infrastructure is overloaded, services start to "cut corners": features come with timeouts, stale values are taken from cache, some sources become unavailable. The model is formally the same, but inputs are worse — errors increase and metrics drift. Reproducibility matters too: the same request with the same data should yield the same result, without random deviations caused by fluctuating timeouts and load.

Common bottlenecks repeat:

CPU — insufficient single-thread frequency or contention for cores at peak.
RAM and caches — cache misses, garbage collection, page faults.
Network — extra hops between services, overloaded interfaces, latency to feature stores.
Disk — slow feature access, write queues for logs, poor NVMe configuration.

Before buying hardware, agree with the business on simple things: target SLA for p95 and p99, maximum RPS at peak, what to do on degradation (reject, simplify checks, fall back to a spare model), and which data are mandatory for a decision. These answers define the latency budget and indicate where on-prem infrastructure for scoring and anti-fraud must be strongest.

Build a latency budget by steps

Until scoring is decomposed into stages, the argument about CPU, RAM and NVMe is blind. On-prem this is especially true: it’s easier to create a latency budget once and lock it as a contract — what should fit into p50, p95 and p99 under real load.

Take the chain of a single request and measure each step separately. Typically this includes:

request intake and parsing (API, serialization, validation)
feature retrieval (cache, DB, key-value store, queues)
inference (feature preparation, runtime call)
rules and postprocessing (thresholds, explain, merge with business logic)
logging and audit (synchronously or asynchronously)

Then set goals not only for the mean but for the tails. p50 governs perceived speed, p95 — stability, p99 — real user pain and anti-fraud losses. Add a load profile: average RPS, peak and the nature of spikes (e.g., 3–5 minutes with 5x traffic).

Account separately for external data sources. Any external call — bureau, processing, or even a "neighboring" system — makes p99 less predictable. In the pilot, fix behavior under degradation: cache responses, drop features, reduce precision or fall back to rules.

Agree upfront what counts as a failure: a timeout (for example, 50–100 ms for the whole scoring), exceeding p99, rising fallback share, or missed logging. Define pilot metrics: latency distribution by stage, RPS at a target p99, timeout share, queue lengths, CPU and memory utilization, and feature-store response time.

CPU for inference: frequency, cores and NUMA

For scoring and anti-fraud, predictable low latency often matters more than peak aggregate throughput. Therefore CPUs with high per-core frequency and modern architecture often win: one inference request usually follows a chain of operations where accelerating a single thread matters more than "all cores at once."

More cores become critical when many requests arrive concurrently, when multiple models share a node, or when rules, feature enrichment and validations run alongside the model. But there’s a price: under high contention queues and context switches grow, and p99 drifts.

NUMA in simple terms

On two-socket servers memory is "closer" to its processor. If a process runs on CPU0 but memory is allocated on the other half, RAM access is slower and latency starts to jump. This usually shows up in p95–p99.

A practical rule for a pilot: one inference service — one NUMA node (and its memory). Or clearly partition models by nodes. On-prem this often yields more effect than buying a CPU with more cores.

Power-saving, turbo and stability

Power-saving modes and aggressive turbo can improve average numbers but hurt stability. Under sustained load you may see throttling and a "sawtooth" frequency pattern that makes response times drift.

On the pilot bench check:

per-core utilization (is one core saturated while others idle?)
context switches (spikes often indicate thread contention or too many processes)
frequency and throttling (does clock stay steady during long runs?)
NUMA locality (is memory landing on a remote node?)
p95 and p99 (compare before and after CPU/memory pinning)

If an anti-fraud service on a dual-socket server sometimes jumps from 8 ms to 25 ms, first check NUMA binding and sustained frequency. Only then consider adding cores.

RAM and caches: often underestimated

In on-prem setups for scoring and anti-fraud RAM often matters more than expected. When memory is tight the system can look fine for a long time, then produce rare but very long latencies. That tail in p99 comes from garbage collection, queue overflows, large buffer allocations and page-ins.

CPU cache and data locality matter because models and features are read many times in small pieces. If features are laid out so the CPU frequently goes to RAM instead of L3, inference time grows and becomes less predictable. The same happens when you pull "wide" features from disparate structures, make many small calls and create many short-lived objects.

Memory speed and channel count show up when inference is limited by RAM bandwidth: many parallel requests, large feature vectors, several models simultaneously. In such modes memory becomes the bottleneck even with a strong CPU.

To estimate required memory, sum: model(s) in memory, hot-feature cache, working buffers, queues, system overhead and headroom. A simple guideline: keep 30–50% free RAM so peaks don’t push the system into swap.

On the pilot monitor:

whether swap is used and any swap access
major page faults and latency growth at peaks
L3 cache miss rate and IPC (signs of memory-bound workload)
memory pressure during p99 moments (queues, allocations)
stability after several hours under load

If the pilot runs on a GSE S200-class server, record metrics before and after changing RAM size and feature-caching policies. Often this improves p99 more than upgrading the CPU.

CPU or GPU: how to choose for scoring and anti-fraud

The CPU vs GPU decision is not about "who’s faster" but about how you measure success. Anti-fraud usually prioritizes predictable p99 and low single-request latency. Scoring sometimes prioritizes peak throughput.

CPU is often sufficient if the model is compact (gradient boosting, logistic regression, small embeddings), computations are simple, and single-request latency must be stable. CPUs are easier to isolate by cores, simpler to control background work and easier to debug during model switches.

GPU makes sense when the model is heavy, involves many matrix ops, and you can keep the device highly loaded. Typical use-case: streaming transaction checks with many features and a consistent compute graph where batching gives cost and throughput wins.

The compromise is almost always batching. It speeds batch processing but can hurt single-request latency: a request waits until a batch fills or the queue drains. If your SLA is "each transaction within 20 ms", aggressive batching can worsen p99 even if the mean improves.

In the pilot check not only max numbers but stability:

cold start and warm-up (how long until stable p50/p99)
model version switch (do delays or errors spike on hot reload?)
queues and batching (how does p99 change with QPS and batch sizes?)
environment effects (GPU drivers, CUDA versions, power modes, clocks)
degradation under background load (logging, metrics, background jobs)

A good test is to run identical traffic on CPU-only and CPU+GPU and compare the worst 1% of requests, not the best numbers. For anti-fraud, the option with a steady p99 often wins, even with lower peak throughput.

Network and time: don’t lose milliseconds

Latency audit by stages

We will analyze your scoring chain and find where milliseconds are lost.

Get consultation

Even if model and CPU are fast, the network often makes p99 unpredictable. For scoring and anti-fraud the tails are dangerous: short queue bursts on a switch, full buffers, micro-overloads during peak hours. As a result normal 5–10 ms can turn into 50–150 ms for some requests.

At the NIC level check not only gigabits but latency and stability. Offload features sometimes help, but enable/disable them in the pilot and compare p95/p99. If multiple services share a host, ensure drivers and queue settings don’t create CPU contention.

To make the network quieter, segment traffic types up front: online scoring, feature and storage access, admin and monitoring, backups and replication. This reduces the chance that a nightly backup suddenly raises scoring latency in the morning.

In anti-fraud timing matters as much as network. Events come from different systems and if clocks drift, rules and models misorder events. You need unified time synchronization across nodes and drift monitoring.

On the pilot bench test:

packet loss and retransmits under load
jitter between nodes
RTT between services within and across segments
impact of background traffic on p99

A useful scenario: run scoring load and add "noise" (logging, replication, backup) in parallel. Comparing p99 before and after segmentation quickly shows where milliseconds are lost even before procurement.

Storage and feature access: NVMe, cache and logging

In scoring and anti-fraud latency often "wakes up" not in the model but in data access: features, client profiles, lists (blacklist, whitelist), recent events by card or device. If reads hit slow storage, p95 and p99 rise even with a fast CPU.

NVMe often gives more benefit than a large RAID on slow disks. Scoring performs many small random reads. Here IOPS and stable latency matter more than gigabytes or peak sequential speed in a spec.

Disk is involved in three places: online feature storage (key-value or local DB), profiles and state (e.g., per-client counters), and event logs for analysis and training. If all of this lives on one volume, logging can quietly consume I/O queue and spoil read latency.

To avoid cold-start pain after a restart, have a warm-up plan: keep hot features and lists in memory, and on startup probe typical keys. Estimate how much RAM is needed for cache vs model and service.

Write logs so they don’t block scoring: separate online decisioning from detailed audit, buffer and send in batches, and store logs and features on different NVMe drives or volumes where possible. Watch fsync and write frequency — a frequent source of p99 issues.

On the pilot track not only averages:

p95/p99 for feature read latency
I/O queue depth and growth at peak
read latency drops when logs are written concurrently
warm-up behavior after restart (how long to warm caches)
stability on real keys, not synthetic ones

If you’re building a pilot on an S200-class platform, ask for measurements of time-to-feature rather than only inference speed. In anti-fraud milliseconds are often lost on the way to the data.

Reliability and isolation: keep p99 steady

CPU vs GPU testing

We will prepare a bench to compare CPU-only and CPU+GPU on the worst 1% of requests.

Assemble bench

In scoring and anti-fraud it’s not average latency but p99 that matters. On-prem this often hinges not on "stronger CPU" but on spare capacity, reliability and who shares resources with your service.

A single node failure equals downtime. Production needs a cluster of nodes and a clear plan: what happens if a server, network link or disk fails. Decide in advance whether it’s better to continue scoring with reduced capacity or to turn off some features to preserve quality and consistency.

Headroom directly affects p99. If a server constantly runs at 80–90% CPU, any spike (GC, background jobs, interrupts, request queues) becomes long tails. It’s more practical to plan for lower steady utilization and keep headroom for peaks and degradation when one node fails.

Isolation from noisy neighbors is mandatory. Scoring must not share a host with heavy tasks like training, exports, reports or backups. Even rare background processes can steal disk, cache and CPU for seconds.

On the pilot test:

node failure (p95/p99, request queueing, recovery time)
storage degradation (are features and logs lost?)
noisy neighbor (how does p99 change?)
service restart (time to warm caches)
audit and access controls (who can access, is there an action log?)

A practical bank test: simulate a transaction peak and the failure of one node. If p99 multiplies, you lack headroom or have poor isolation.

Pilot step by step: how to validate hardware before purchase

A pilot is not for pretty graphs but to catch what later breaks p99: spikes, rare heavy requests, CPU and memory contention. Start with base numbers: target requests/sec (with growth headroom), share of complex cases (e.g., 5–10% suspicious with many rules and features), average and max feature and response size.

Build a test contour that resembles production but small: 1–2 models, rules layer (if any), a feature source (cache or store) and mandatory logging. The chain must be honest: feature retrieval, preprocessing, inference, postprocessing, event recording.

Generate load not as a flat line but like real life: morning peaks, short bursts, night windows, and a mix of light and heavy requests.

On each run record: p50/p95/p99 for the whole chain and separately for inference, CPU per-core usage and frequency, context switches and queues, memory, network and I/O behavior. Also log errors: timeouts, degradations, latency growth during warm-up.

Then compare 2–3 configurations on the same test: a CPU-oriented setup (high frequency and correct NUMA pinning) and a GPU variant if models truly benefit from batching. Often the answer is mixed: change hardware selectively (add RAM or NVMe) and tune settings (thread pools, pinning, feature cache, timeouts).

If working with an integrator, agree on pass/fail criteria for p99 and error rates. Integrators like GSE.kz can help assemble a bench on S200 servers and record measurements so they can be reproduced before procurement.

Example scenario: an anti-fraud pilot in a bank with peak loads

A bank builds on-prem infrastructure for real-time payment scoring and anti-fraud. On normal days traffic is steady, but on payroll days load can triple. Latency requirements remain: p99 no higher than 60 ms for the full decision, including feature retrieval and event recording.

The pilot is a mini-production copy: an inference server, a separate feature service and simulated external dependencies (queue, APIs, logging). From the start they test tails p95/p99 because in anti-fraud these tails become timeouts and lost transactions.

They run two inference modes: single requests (truer for latency) and micro-batches (when incoming flow is high). They also compare two feature approaches: RAM cache for hot keys and disk reads on miss.

Typical findings: as QPS rises p99 appears due to CPU contention and NUMA jumps even if mean latency looks fine. Aggressive synchronous logging causes p99 to spike due to I/O — NVMe and asynchronous logging help here. Feature cache in RAM often brings larger p99 gains than trying to speed the model. Micro-batching reduces CPU load but can harm latency if batch sizes aren’t chosen with peak traffic in mind.

The pilot result is usually not "buy the most powerful server" but a predictable kit: enough CPU frequency for single-threaded parts, RAM headroom for feature cache, NVMe for logs and local data, and low-latency network between services. This keeps p99 within bounds even on peak days without rewriting the model — via configuration and operating modes. If testing on GSE S200-class servers, record BIOS, NUMA and power profile settings so results reproduce during procurement and deployment.

Common mistakes when choosing on-prem hardware and running tests

Hardware requirements for pilot

We will create hardware requirements for CPU, RAM, NVMe and NIC according to your model and traffic profile.

Request spec

Pilots most often fail because of belief in average latency. In scoring and anti-fraud the tails matter: p95 and especially p99. They are what turn "fast" into "the user waits" and hit conversion, or in anti-fraud cause missed attacks and excess blocks.

Second frequent mistake is to economize and put everything on the same nodes. When heavy background jobs (training, exports, reports, log scanning) run nearby, inference queues grow and p99 spikes with no obvious cause.

Typical errors when selecting on-prem hardware for scoring and anti-fraud:

choosing CPUs with many cores but low frequency — single requests take longer
underestimating the network (switch queues, wrong MTU, rare packet loss)
drawing conclusions from cold start (cache not warmed, model and features not in memory)
testing with synthetic load lacking real peaks and data

An example: at 200 RPS everything looks fine, but short bursts to 800 RPS triple p99. The root cause is often not the model but CPU contention and network/feature-store queues.

Good practice: measure warmed and cold modes separately, add realistic peaks and record not only latency but CPU per-core utilization, memory, network queues and I/O.

Short checklist and next steps

Build on-prem infrastructure for scoring and anti-fraud from measurable metrics. One short pilot often beats months of configuration debates.

Pilot: must-check items

Make the pilot realistic: same timeouts, same peaks, same "bad" requests. Minimum set:

p50/p95/p99 for the whole path and separately for inference, feature fetch and network
behavior under peaks (RPS spikes, queues, p99 degradation)
cold start (after service restart, cache warm-up, model loading)
single-node failure (failover, latency growth, lost requests)
time control (event time, time zones, clock skew)

After these checks you usually see where time is lost: NUMA and CPU caches, lack of RAM, disk waits or network hops between segments.

What to finalize before procurement

To avoid guessing at purchase time prepare a short set of validated configs and headroom plans:

2–3 confirmed configurations with CPU/RAM/IOPS headroom calculations and target p99
CPU requirements (frequency, NUMA pinning rules)
RAM requirements (models, features, cache to avoid swap) and NVMe for logs and feature store
network requirements (NIC, segmentation, gateway and load balancer placement)
scaling plan: when to add nodes and how to verify p99 remains stable

If you need an external view, a systems integrator can help choose equipment, build the pilot bench and support it. GSE.kz, for example, offers an S200 server line for racks and workstations/PCs for teams — useful when you need to quickly turn a pilot into a working contour on hardware from a single vendor.