On-prem voice assistant for contact centers: requirements
On-prem voice assistant for contact centers: how to choose ASR/TTS, meet latency targets, store recordings and connect CRM without overloading infrastructure.

Where the problem begins: when voice hits infrastructure
On-prem voice assistant for a contact center means recognition, synthesis, recordings and integrations run inside your network and on your servers. Organizations choose this when data control, predictable latency and security requirements are critical. This is also where many projects stumble.
Pilots usually look simple: one scenario, a few agents, test calls. Everything works. Then reality hits: many lines, noise, different headsets, load peaks, mandatory call recordings, strict access rights. At this point it turns out the issue isn’t the script text but infrastructure readiness: insufficient compute, a jittery network, storage filling up fast, and integrations blocked by security restrictions.
Almost always several teams are involved. The business defines what the assistant must do and which results matter. IT owns servers, networking, virtualization and backups. Security sets rules for access, logging and data retention. The contractor or in-house development team ties it all together and builds integrations with telephony and CRM.
Before choosing specific ASR/TTS engines it helps to make several decisions. Otherwise you’ll be comparing vendors blind and continuously changing requirements.
First, decide where services will live: dedicated servers or virtual environment, one datacenter or multiple. Second, fix latency and quality targets for real calls, not lab tests. Then decide what and how you will record (audio, transcripts, metadata, retention periods), which integration methods are acceptable (APIs, queues, files) and who owns CRM changes. Also describe fault tolerance: what happens if ASR, TTS or CRM fail.
For example, if you plan to keep voice processing on-prem, clarify in advance whether suitable servers and 24/7 support exist in your perimeter. In Kazakhstan this is often solved on local infrastructure, including racks with high-performance servers that are easier to service on-site.
ASR: requirements for recognizing real calls
Recognition in contact centers fails not on polished demos but on real lines: handset noise, fragments of phrases, fast speech, interruptions. So agree early on what counts as “good enough.” Otherwise pilots get stuck in endless debates about quality.
Quality is easier to discuss through business errors. WER is useful, but more important is the share of correctly recognized intents and entities: contract number, IIN, address, amount. Be sure to test resilience to noise (office, street, loud speakerphones), accents and typical telephony artifacts.
Languages and code-switching frequently become blockers. If clients switch between Russian and Kazakh in one sentence, require support for code-switching rather than two separate modes. Otherwise the system will jump between languages and lose meaning.
Custom dictionaries and rules are almost always needed: full names, toponyms, brands, tariff names, abbreviations. A good practice is to version dictionaries and measure how changes affect entity errors.
Behavior on pauses and interruptions is critical for real calls: where the system marks the end of an utterance, what happens on silence, how it reacts to backchannels like “uh-huh” and to the operator’s background remarks.
To improve quality without access to raw audio, agree beforehand what to log. A minimal set usually includes:
- timestamps for utterance start and end, pauses and interruptions;
- recognition text and confidence at word or phrase level;
- detected language (if multiple);
- found entities and reasons for failure (e.g., “could not parse number”);
- technical codes, model version and dictionary version.
Example: a client dictates an address and IIN mixing Russian and Kazakh while a colleague speaks nearby. If logs show the model was “confident” in a wrong IIN and did not flag low confidence, you won’t be able to tune reprompts and safe data checks correctly.
TTS: requirements so the customer understands on the first listen
In voice scenarios TTS quality is often more about clarity than “a pretty voice.” A customer needs to grasp meaning quickly, especially on a noisy line.
The main point is intelligibility. Naturalness is nice, but if endings are swallowed or stress patterns vary, clarification requests and frustration increase. A neutral delivery usually works better: not a radio-host style, no excessive emotion, with clear consonants.
Pacing and pauses must feel conversational. Too fast and phrases become a blur; too slow and handling time increases. Check pauses before important items and after numbers: “Your code is 4 1 7 9.” Long replies are better broken into short sentences.
Voices and roles: one voice may be enough, but for confirmations, warnings and status messages it’s often useful to have 2–3 roles (main, “service”, more formal). Ensure the style remains consistent without tone jumps.
A recurring pain is pronunciation of terms, names, dates and bilingual names. You need dictionaries and reading rules: how to speak amounts and numbers (digit-by-digit or grouped), how to read dates (02.01.2026 vs “January 2”), how to read acronyms (CRM, IT, ЖКХ), and how to pronounce Kazakh and Russian names and toponyms correctly.
On the technical side, agree the format with telephony. For PSTN you often need 8 kHz, 8-bit mu-law or A-law, mono. Internally you can use 16 kHz PCM for better quality, but conversion and normalization must be controlled so the assistant isn’t quieter than the caller and the signal does not clip.
A practical test: take 20 typical phrases (with amounts, dates, full names) and run them through a real call queue, listening to recordings. If operators have to “translate” what the assistant said, TTS needs work. On-prem deployments make iterating easier on dedicated servers—e.g., rack platforms like GSE S200—so quality doesn’t drop at peaks.
Latencies: goals and how to verify them
Dialogue latency is the sum of small parts across the chain: telephony (capture and codec), network, preprocessing (VAD, noise suppression), ASR (partial and final hypotheses), dialog logic, TTS (generation of the first fragment) and delivery back to the call. In on-prem setups delays usually accumulate across multiple components in the range of roughly 50–200 ms at several points.
Set targets separately for customer dialogue and operator hints:
- Dialogue: from the end of the customer’s phrase to the start of the assistant’s reply — aim for 0.8–1.2 s (p95) to avoid awkward pauses and interruptions.
- Operator prompts: 1–2 s (p95) is often acceptable because the operator values accuracy and stability more than instant response.
- TTS: the first sound should ideally appear within 0.3–0.5 s even if the full reply is longer.
Network and telephony quality strongly affect latency. Bottlenecks include jitter, packet loss, transcoding, heavy codecs, and processing queues when requests pile up under load. VAD and noise suppression improve recognition but can add latency due to waiting for an utterance end and segmenting speech.
Test latency early on a staging environment and in a pilot, using consistent metrics. Timestamp the stages (audio in, ASR start, ASR final, dialog decision, TTS start, first audio frame on the line). Look at p50/p95/p99, not just averages. Run loads close to production: concurrent calls, typical noises, different codecs. In the pilot measure real conversations and search for time-based queues, not just CPU load.
If you deploy on your own servers (for example, in racks at a datacenter), verify that network links between telephony, ASR/TTS and CRM don’t introduce extra hops or bottlenecks.
On-prem architecture: what the solution consists of and how to deploy it
On-prem solutions break not because models are bad but at the seams: where telephony runs, where recordings are stored, how to scale ASR/TTS and who owns fault tolerance.
A minimal working contour usually includes call ingress (SIP/telephony), speech recognition (ASR), understanding (NLU), a dialog engine (scenario logic), speech synthesis (TTS) and a recording subsystem with metadata. Decide early where the single source of truth for a call is stored: audio, transcript, classification results and dialog events.
Where to place orchestration depends on your operations rules. Containers are convenient for fast updates and scaling, VMs are easier for conservative IT policies, and dedicated nodes give more predictable latency. A common compromise is containers for ASR/TTS and dialog, with telephony and storage on a more stable base.
GPUs are not always required but often help in two places: they speed up streaming ASR and reduce TTS latency during peak calls. Even if a pilot runs on CPU, keep the ability to add GPU nodes without redesigning the entire architecture.
To avoid a complete service outage when one server fails, check three things: run at least two instances of key services (ASR/TTS/dialog) with fast failover, use independent storage for recordings and logs, and design graceful degradation—for example, temporarily disable analytics but keep answering the customer.
Separate contours: production, test, model training and analytics should not compete for the same resources and data. For on-site deployment rack servers (like the S200) and an integrator experienced in building 24/7 operational contours are often suitable.
Recording and logs: how not to get surprised by growth
In on-prem contact-center projects the issue is often not ASR quality but that storage and retention rules weren’t calculated in advance. Data accumulates quietly and disks run out or search performance degrades.
There are typically several data layers: call audio (inbound and outbound), transcripts, metadata (number, time, queue, agent, result), dialog events (which steps ran, where a failure occurred), technical logs from ASR/TTS and integrations.
Estimate volume with a simple formula: average call length × calls per day × retention days. Then multiply by a reality factor: often there are two channels, plus service backups. Example: 10,000 calls/day at 4 minutes each with 90 days retention quickly becomes tens of terabytes even at moderate quality. Raw logs and full transcripts increase volume further.
Split retention by value. Always keep metadata and dialog outcome (for reports and incident reviews). Store audio and full transcripts according to SLA, case types and legal requirements. Build anonymization from the start: mask phone numbers, IINs, cards and addresses in transcripts and logs. Keep technical logs short, storing aggregations and errors.
Search and retrieval must work without manual exports: by number, period, agent, queue, topic and result. Otherwise incident reviews become constant manual work.
Plan backups before the pilot and test restores: recover a single call and its related events, restore a selected day and verify search, check integrity (hashes, record counts), recovery time and impact on contact-center operations.
Security and compliance: access, encryption and audit
When building an on-prem voice assistant, security must be included before the first recordings appear. Otherwise the project will quickly hit information security restrictions: where audio is stored, who can listen, and whether transcripts can be exported.
Start with a simple access model. Separate rights not only by role but by data type: audio, text, reports, training datasets. An agent usually needs the case card and a short summary. Full recordings and transcripts should be for supervisors and quality teams. Analysts usually need aggregated reports, not raw calls.
Encrypt data both in transit and at rest. Use secure channels between telephony, ASR/TTS, storage and CRM. For storage enable volume or object encryption and keep keys separate from the data (in a dedicated service or secure module). Decide who can rotate keys and what happens on rotation.
Audit is mandatory: log who searched for a call, who listened to a recording, who exported a transcript and what filters were applied. This helps investigate incidents and answers compliance questions: “who saw the customer data.”
Segment the network: a separate contact-center contour, separate subnets for speech services and storage, minimal open ports and access only where needed.
To reduce leakage risk, implement masking of sensitive data in speech. For example, when a client dictates an IIN or card number the system can replace the fragment in the transcript with “[hidden]” while storing audio under stricter access rules.
Integration with telephony and CRM: what to plan before development
If an on-prem assistant “doesn’t take off,” it’s often not the ASR or script but that telephony and CRM don’t provide required events or accept results. Before development agree integration points and data formats, otherwise the team will be fixing infrastructure during the project.
Clarify which telephony interfaces are actually available and who controls them. It’s critical not only to bring up SIP but to control the call via IVR/ACD or CTI: transfers, hold, queue, recording and agent status. Without this the assistant can’t understand its position in the servicing chain.
Fix the minimal event set that must be delivered to the system:
- call start (call-id, number, direction, queue);
- transfer or queue change (reason, target, who initiated it);
- agent connect (agent-id, answer time);
- call end (end code, who hung up);
- link to recording (recording-id, storage location).
Then decide what you need from CRM: lookup by phone or IIN, pull interaction history, create a ticket and populate fields (subject, reason, result, tags). Agree reference lists for reasons and statuses so the assistant doesn’t write free text where CRM expects codes.
Describe routing: when the assistant must pass to an agent (negative sentiment, cancellation request, verification error) and what context is sent with the handover. Usually that’s a short summary, recognized data and a note “what was already asked,” so the agent doesn’t start from scratch.
Reliability matters. Integrations should survive failures: message queues between components, retries with timeouts and idempotency by key (e.g., call-id + action). That avoids duplicate tickets and broken case histories on repeated deliveries.
Load and observability: keeping quality at peaks
Contact-center peaks are usually predictable: lunch, end of day, paydays, seasonal campaigns. If an on-prem solution isn’t sized for those spikes, quality silently degrades: customers repeat themselves, transfers to agents increase, and teams see only “services are up.”
First agree which signals indicate a problem. Collect metrics not only on hardware but on speech and dialogs:
- share of recognized text and share of “didn’t understand” (ASR) by topic;
- recognition and synthesis latency (p50 and p95) by dialog step;
- errors and failures: timeouts, queue overflows, model crashes;
- reasons for escalation to an agent (customer request, low confidence, missing data);
- “on-audio” problem spots: where customers repeat phrases and which words are commonly confused.
Then add infrastructure monitoring: CPU, RAM, GPU (if used), disk, network, queue lengths and concurrent sessions. A common pattern: average resource metrics look fine, but ASR queues grow at peaks and latency spikes.
Define load plans numerically: current concurrent calls, peak, expected channel growth in 6–12 months. Run load tests on realistic scenarios, e.g., a lunch peak with 200 calls where 30% of clients speak fast and interrupt.
Updates require discipline. Good rules: staged rollout (part of lines or scenarios), quick rollback of models and prompts, compare metrics before/after on identical time windows, and version dictionaries, normalization rules and integrations.
If the solution runs on your servers (often S200 racks), observability must be 24/7 like the contact center. Otherwise customers notice issues before monitoring.
Example scenario: recording a customer and creating a CRM case
Imagine an on-prem assistant handling an inbound call about “no internet.” The goal is to collect data without annoying the customer and hand a prepared case to an agent.
A short dialog flow could be: greeting and consent to record, identification (contract number or IIN) with 1–2 confirmation questions, clarify the issue (what happened, when it started, address or connection point), confirm key data and ask to say “yes,” then finish by giving a case number and transferring to an agent if needed.
High ASR accuracy is critical where errors are unacceptable: contract number, IIN, address, equipment model, dates. For problem description, capturing intent and slots is often enough: the assistant recognizes “no internet” or “low speed” and saves details for the agent if needed.
Context transferred to CRM should be concise and useful: subject, urgency, recognized identifiers, start time and a note “what was already asked.” The agent sees a 3–5 line summary and access to the recording inside the corporate contour.
For disputes keep both audio and transcript with timestamps. It’s useful to save the recognition version before agent edits and the final corrected version to analyze model errors.
Anticipate typical input failures: noise and echo (especially on mobile), long pauses and interruptions, digit dictation run together or one-by-one, wrong data and guessing, and language switches mid-utterance.
If infrastructure is local, plan a performance buffer so recording, recognition and CRM writes don’t slow even at peak.
Step-by-step plan: taking a pilot to production
A pilot looks convincing until real queues, recordings, access rights and dozens of integrations appear. To avoid infrastructure roadblocks, lock a production transition plan before the first demo.
First agree what you are testing. Prefer 3–5 simple scenarios that deliver business value and are easy to measure: application status, appointment booking, balance check, transfer to agent. For each scenario set success criteria: share of correctly recognized key words, percent of completed dialogs, acceptable transfer rate to agents.
Next capture data requirements: which audio and transcripts are stored, retention times, who has access, and where logs are kept (including ASR/TTS and integration errors). This quickly reveals hidden blockers: a ban on unencrypted storage, separate contours for test and prod.
A practical sequence of work is:
- Fix scenarios and quality metrics in plain language so business and security understand them.
- Describe data flows: call recording, transcript, client identifiers, retention, roles and audit.
- Estimate pilot and production load: concurrent calls, peaks, headroom for CPU/GPU, network and disks.
- Run a pilot measuring latencies, failures and degradation at peak, not only on “ideal” calls.
- Prepare a production contour: monitoring, redundancy, update procedures, rollback plan and 24/7 on-call.
Simple example: a clinic’s “book appointment” pilot ran on 10 lines, but production added 90-day audio retention and CRM integration. Disk needs and access rules increased dramatically, and without a prepared production contour the project would have stalled on approvals and procurement. If you already have on-prem infrastructure (e.g., S200 platforms), create a separate test bench and repeat measurements before each release.
Common mistakes that cause schedule and budget overruns
The most expensive problems in on-prem contact-center projects aren’t ASR models or voice quality but leaving infrastructure and processes for later. The pilot “flies,” but the production system slows, consumes disks and breaks scenarios.
Typical mistake one: buy ASR/TTS and not plan storage. Recordings, training audio, recognition logs, reply texts and metadata grow faster than expected. If retention isn’t defined up front (what stays 7 days, 90 days, or forever), you’ll suddenly hit disks, backups and maintenance windows.
Second mistake: not measuring latency in the pilot. Demos sound fine, but real calls reveal a pause between the customer and the reply. Customers interrupt, agents have to pick up, and automation loses value. In the pilot log numbers for each step: from audio reception to text, text to reply and total dialogue latency.
Third mistake: postponing CRM integration. Later you may find required fields missing, mismatched statuses, or scenarios relying on data the CRM doesn’t expose. For example, the assistant promises to “create a case” but can’t set category or owner, and the case goes to a catch-all queue.
Organizational errors also cause problems. Mixing test and prod (one server, shared accounts) makes bugs hard to reproduce and access separation difficult. No support and update plans lead to expired certificates, mismatched dependency versions and single-person knowledge. Failing to agree access rules for recordings and logs early forces redesigns of storage and audit.
Discuss these things before development and the project reaches production faster without sinking into infrastructure rework.
Quick checklist and next steps before project start
Before buying hardware or coding the bot, fix a minimal set of numbers and constraints. On-prem projects need this: mistakes in audio, latency or storage quickly block pilots.
Checklist for speech quality and latency
Test on real call recordings, not studio examples, otherwise estimates will be overly optimistic.
- Languages and variety: main client languages, accents, code-switching, frequent names and addresses, stop-words.
- Noise and channel: call-center background, mobile links, interruptions, pauses, emotional speech.
- Dictionaries and entities: product names, branches, full names, codes, contract numbers and how often they change.
- Audio format: sample rate, codec, mono/stereo, operator/client channel separation, recording requirements.
- Target latencies: allowable time to first reaction and between utterances and how these will be measured in tests.
Checklist for data, integrations and operations
These are often forgotten in pilots and then must be redone.
- Storage: what is stored (audio, transcripts, logs), retention policy, estimated volume per day/month, deletion policy.
- Backup and audit: who has access, what is logged, how fast can you recover, how to validate changes.
- Telephony and CRM integrations: events (incoming, transfer, end), context passing (number, client, subject), where dialog ID is stored.
- Load and resiliency: peak concurrent calls, headroom for CPU/GPU, degradation plan under overload.
- Monitoring: recognition quality metrics, successful scenario share, latencies, integration errors.
Next steps: collect requirements in one document (audio, latencies, storage, integrations, peak loads) and hold a short session with an integrator and the team responsible for servers. If you plan local deployment in Kazakhstan, GSE.kz (gse.kz) can help estimate server configurations and integration contours so the pilot is close to a production setup from the start.
FAQ
Where should I start an on-prem voice assistant project so it doesn't get stuck on infrastructure?
Start by fixing real constraints: where services will run (dedicated servers or virtualization), acceptable p95 latency for dialogues, what exactly and how long you store (audio, transcripts, metadata), and which integration methods with telephony and CRM are allowed by information security. These decisions affect architecture more than the choice of a specific ASR/TTS model.
Which ASR quality metrics truly matter for a contact center?
In contact centers, business errors matter more than attractive WER numbers. Ensure key entities and intents are recognized reliably: IIN, contract number, address, amounts. Set requirements for the share of correctly extracted entities on real calls and define behavior for low confidence (how the system should reprompt instead of "confidently" making mistakes).
What if customers speak mixed Russian and Kazakh?
If clients switch between Russian and Kazakh within the same utterance, require support for code-switching in a single recognition stream. Two separate modes (“either Russian or Kazakh”) often cause language jumps, loss of meaning and entity errors, especially in names and addresses.
What TTS requirements are critical so the customer understands on the first try?
Intelligibility beats naturalness in TTS for voice scenarios. Agree on rules for pronouncing amounts, codes, acronyms and names in advance and test on your typical phrases—otherwise operators will end up "translating" what the assistant said.
What latencies are reasonable and how should they be measured?
Target p95 latency from the end of the customer's phrase to the assistant's reply around 0.8–1.2 s for dialogue. For operator prompts 1–2 s (p95) is often acceptable. Measure timelines with timestamps at each stage because latency accumulates across telephony, network, preprocessing, ASR/TTS and queues.
What components make up an on-prem solution and what must be considered in the architecture?
A minimal on-prem contour includes call ingress (SIP/telephony), ASR, NLU, a dialog engine, TTS and a recording subsystem with metadata. Decide where the source of truth for a call lives (audio, transcript, dialog events) and how redundancy will be handled, otherwise the pilot will work but the production contour will fail on integration points.
How to estimate storage for recordings and avoid a sudden disk growth?
Estimate volume with: average call duration × calls per day × days of retention, then adjust for reality (two-channel audio, service copies, logs). Split retention by value: keep metadata and dialog outcomes longer, store full audio and transcripts per SLA, and apply anonymization for personal data in transcripts and logs.
What baseline security and audit requirements are needed for an on-prem voice assistant?
Separate access by data type: operators usually only need the case card and a short summary; full recordings and transcripts should be limited to supervisors and quality teams. Encrypt data in transit and at rest, keep keys separately, and log who searched for, listened to or exported data. Decide masking rules for IINs, card numbers and addresses in transcripts.
What must be prepared for telephony and CRM integration before development?
Agree on which telephony interfaces are actually available and who controls them. Beyond SIP, you need call control via IVR/ACD or CTI: transfers, hold, queue information, recording and agent status. Also agree a minimal event set (call start, transfers, agent connect, end and link to recording) so the assistant understands its place in the flow.
How to maintain quality at load peaks and detect degradation early?
Collect metrics not only for hardware but for speech and dialogs: share of recognized text vs "didn't understand", p95 latencies by dialog step, reasons for escalation to an agent, timeouts and queue growth. Enforce staged rollouts and quick rollback procedures so degradations are visible immediately, not only through user complaints.