Why does on-prem RAG run fast in a pilot but slow down in production?

Because pilots usually run on a small dataset and don't reflect real operations. As the collection grows you often discover limits are in RAM, disk latency, long reindexing and complex backups — not in the benchmark search speed.

Where to begin when choosing a vector DB for a closed perimeter so you don't guess?

Start by recording scenarios and numbers: how many documents and chunks now and in a year, embedding dimensionality, share of queries with filters, target p95 and QPS. Then add perimeter constraints: maintenance windows, allowed downtime, RPO/RTO and data access rules.

Which performance metrics matter most for on-prem RAG?

Measure not only p50 but also p95 for search, both with typical filters and without. Also measure how latency changes during imports and updates, and how long it takes to build or rebuild an index after loading.

Why do metadata filters affect search performance so much?

Filters are usually not free: they can slow searches and even reduce quality if implemented on top of results or if the index is not designed for them. Test with real conditions like department, access level, date and document type; otherwise results will look good but be misleading.

How to estimate RAM needs for vectors and index in advance?

Start with a rough estimate of vectors size: N × D × B, where N is the number of vectors, D the dimensionality and B the bytes per number. Then add overhead for the index, caches and auxiliary structures, multiply by the number of replicas, and leave space for the OS to avoid instability.

What to clarify about index mode: RAM or disk/mmap?

Ask whether the index must live fully in RAM or if a disk/memory‑mapped mode is available. Also check if caches and memory limits can be configured so the database doesn't crowd out other services like LLMs, embedding services or the orchestrator.

Which disk requirements are critical for an on‑prem vector DB?

Look at latency and IOPS, not just capacity. Under concurrent search and background tasks like ingest or compaction, slow disks create long p95 tails. NVMe helps with high parallelism; if updates are rare, a good SATA SSD may suffice and budget is better spent on redundancy and backups.

How to organize backup and recovery so you don't lose the index or time?

Define RPO and RTO first, then choose an approach: logical exports, snapshots or file copies. Importantly, regularly run restore drills on a testbed and verify not only data but also index, metadata and access rights — otherwise real downtime will be much longer than expected.

What matters most in replication and fault tolerance for on‑prem RAG?

At minimum you need replication across nodes, predictable automatic failover, and protection against divergent states after network problems. Keep in mind replication multiplies disk and often RAM needs, since each replica holds its own index and caches.

How to prepare operations in a closed perimeter: updates, monitoring and security?

You need a repeatable update process without internet: a local package repository, testbed verification, a scheduled maintenance window and rollback procedures. Also set up monitoring for simple signals: RAM and cache growth, disk fill and I/O rate, p95/p99 latencies and errors so on‑call staff can quickly find the cause of degradation.

Vector database for on‑prem RAG: choosing for a closed perimeter

Why choosing a vector DB for on‑prem RAG often stalls

In an open environment many start with: “let's pick the most popular DB and copy someone else's recipe.” For on‑prem this rarely works. In a closed perimeter what's important is not flashy benchmarks but how the system behaves for weeks without access to cloud services, marketplaces and “quick fixes.”

The first reason for stalling is the gap between a pilot and production. On a test set of 50k documents everything flies. After launch you suddenly realize the vector DB for on‑prem RAG is limited by RAM, not CPU: indexes bloat, the cache evicts useful data, and unpredictable latency spikes appear.

The second reason is neglected “boring” requirements: disks, backups and recovery. Pilots rarely check how big indexes will be in 3–6 months, how long a rebuild takes, or what happens on power failure or node loss. That's where costly surprises appear: long recovery times, inability to roll back to a point, and difficulty creating consistent backups without stopping the service.

The third reason is that the business expects predictability. In a closed perimeter stable operation, a predictable total cost of ownership and data control are usually more important than peak speed in ideal conditions. Selection stalls until requirements are formalized as SLAs and infrastructure limits. For example, if the cluster runs on local racks, it becomes critical in advance which disks are needed, how replication is arranged and how much RAM to plan as headroom.

How on‑prem RAG is structured and the role of the vector DB

On‑prem RAG means search over your documents and answer generation work inside a closed perimeter, without sending data to external services. A typical flow has two modes: ingestion (indexing) and query-time (search and generation).

During ingestion documents are taken from file stores, mail or internal systems, cleaned, chunked and embeddings are computed for chunks. Embeddings together with the source text (or a pointer to it) are saved in storage. This is where the vector DB comes in: it stores vectors, builds indexes for fast search and helps filter fragments by metadata (document type, department, classification, date).

Almost always there are three other components nearby: an embeddings service (sometimes on GPU), an LLM for the final answer and an orchestrator app that manages the pipeline and access control.

In production bottlenecks often show up where you don't expect them. Typical pain points are: index settings and type, RAM (can the index and cache fit), disk (IOPS, latency, the effect of large backups), network (if services are on different nodes) and metadata filters (when conditions are many and complex).

If RAG runs on your own racks the role of the vector DB becomes more pronounced: it largely determines whether the system will survive peaks and how predictably it will handle failures.

What to specify before comparing: data, SLA and constraints

Comparing vector DBs easily turns into a contest of “who is faster.” Without describing the data and operational rules, test numbers mean little. The same engine can be great for static regulations and unsuitable for a high‑write ticket flow.

Start with scenarios. For example: employees search internal regulations (rare updates, accuracy matters), IT knowledge base (daily updates), tickets and requests (continuous writes, fresh content required). These scenarios set requirements for indexing speed, update latency and quality of filters.

Then fix volumes in numbers: how many documents now and in a year, average chunk size (characters or tokens), chunks per document, embedding dimensionality (e.g., 768 or 1536). Separate “hot” data (accessed daily) and “cold” data (rarely touched).

Write a short SLA: response time for 95% of requests and peak load (QPS), allowed downtime and maintenance windows, recovery requirements (RPO and RTO).

Finally, think through isolation in the closed perimeter. How many collections and teams will there be, are different access rights needed, are filters by department or classification required. These details will reveal whether multi‑tenancy, strict role models and data separation without manual work are needed.

Performance: what to look for in tests

Performance in RAG rarely reduces to a single number. For an on‑prem vector DB you need to know not only how fast the “nearest” vector is found but what happens with real filters, updates and concurrent load.

In tests record p50 and p95. p50 shows the “normal” request; p95 shows the tail where users notice problems. Add QPS at target accuracy and index build time after loading or reindexing.

Test metadata filters separately. In closed perimeters you almost always filter by department, classification, project, language, date. Filters can greatly slow search if applied on top of results or if the index isn't designed for them. Measure speed on queries with typical conditions, not only on “bare” vectors.

Agree in advance how bulk loading will work: documents per day, nightly loads, whether deletions and updates occur. Know how much the DB degrades during import and when a reindex will be required.

For a pilot keep hardware, dataset and query set constant. For example, in a government scenario: 70% of queries include a department and access level filter, 30% don't.

A protocol with a few metrics is enough: p50/p95 read latency (with and without filters), QPS at target load, load and index build time, degradation during imports/updates, CPU and disk at peak.

RAM consumption: how to estimate ahead and avoid mistakes

Memory often becomes the main limiter in on‑prem projects: buying disks is usually easier than renegotiating server profiles because of RAM shortages. So evaluate a vector DB for on‑prem RAG by memory before the pilot.

RAM is consumed by four things: the raw vectors, the search index (often HNSW or similar graph structures), caches (disk pages, query results) and auxiliary structures (metadata, filters, dictionaries, background task buffers). The worst scenario is an index that must live entirely in memory; collection growth then quickly hits a ceiling.

What to ask the vendor

Ask directly: does the index have to be fully in RAM or is a memory‑mapped mode supported (part of the data stays on disk and is loaded on demand)? Also ask if caches can be limited and memory caps set so the DB won't push out other services on the node.

Rough sizing starts with: raw vectors ≈ N × D × B, where N is vector count, D dimensionality, B bytes per value (4 for float32, 2 for float16). Add overhead for the index and auxiliary structures.

Practical rules:

index and overhead often add 1.5–4× the raw vectors;
multiply the result by the number of replicas (each has its own index);
leave 20–30% RAM for the OS and caches, otherwise the system becomes unstable.

The tradeoff is simple: saving memory usually means slower search or slightly worse quality (e.g., stronger compression and a “lighter” index). Decide what matters for your SLA in advance rather than fixing RAM shortages in production.

Disks and storage: requirements often forgotten

Close security gaps

We will review access control, auditing and data isolation for the closed perimeter.

Get consultation

For an on‑prem vector DB disk is almost as important as CPU. A typical error: the demo is fast but when real data is loaded the system starts to “think” during inserts, reindexing and recovery.

Look not only at capacity but at disk behavior under load. Vector indexes cause many small reads and noticeable writes during updates. So IOPS and latency, write endurance (TBW) and controller quality matter. On cheap SSDs you may hit limits in latency or wear rather than capacity.

NVMe excels under high parallelism: many concurrent search requests plus background writes (ingest, compaction, merges). If queries are few and updates rare, a good SATA SSD may be sufficient and budget could be better spent on redundancy and backups.

Storage growth is often underestimated: an index can grow nonlinearly due to segments, graph settings and auxiliary structures. Plan space with margin and account for all parts: DB and indexes, WAL and service files, temporary files for builds/compaction, replicas, backups and space for recovery.

Backups and recovery: avoiding surprises

In on‑prem projects backups are often treated as “admins' work,” and later you discover recovery takes hours or the copy lacks the index. For vector DBs this is painful: losing the index means long rebuilds and downtime.

Common approaches are threefold. Logical backups (exporting data and metadata) are convenient for migrations and partial restore but can be slow at scale. Disk snapshots are fast and good for point‑in‑time state but require discipline around the filesystem and write freeze. File copies (copying the data directory) are simple but risky if the DB writes during the copy.

Backup frequency should follow RPO/RTO, not habit. If collections update once a day a daily full copy may suffice. If documents and embeddings change continuously you need more frequent recovery points and a clear plan.

To make backups reliable, perform regular restore checks: once a month spin up a testbed and restore from the latest backup; verify data, search (queries, filters, relevance); measure actual recovery time against RTO; store configuration and schema versions alongside backups.

In a closed perimeter keep copies in at least two locations: a separate storage and a medium isolated from the production network (for ransomware scenarios). Access should follow least privilege, with operation auditing and separate credentials for recovery procedures.

Replication and fault tolerance: what to choose for on‑prem

In on‑prem RAG a vector DB failure quickly becomes a search outage: answers dry up and users lose trust. Design fault tolerance from the start, not as an afterthought when the DB has grown.

The minimal useful set: replication across nodes (on different servers), automatic failover to a live node, a mechanism to prevent split‑brain (quorum or equivalent), and observability (replica status and replication lag in metrics and logs).

Then you face the consistency vs availability tradeoff. In an outage (e.g., network loss between racks) strict consistency can halt writes but avoid divergence. A more available mode continues working but requires later conflict resolution. In closed perimeters for government or finance predictability and strict rules usually win, even at the cost of temporary pauses.

Remember replication multiplies resource needs: two replicas roughly double disk and usually require noticeably more RAM since each replica holds indexes and caches.

Plan growth in advance: will you shard by collection or topic, run separate clusters for teams or classification levels, or separate contours (prod for queries and a separate contour for reindexing)?

Operations and security in a closed perimeter

Remove the disk bottleneck

We will help choose NVMe or SSD and calculate IOPS for search and background tasks.

Select storage

In a closed network a vector DB for on‑prem RAG is evaluated not only by speed. Important is daily behavior: who can see what, what goes into logs, how to update without internet and how quickly on‑call can detect issues.

Start with isolation. Check for projects or namespaces, roles and permissions, and whether you can separate access not only to collections but also to metadata. A common risk is users seeing service fields (sources, tags, IDs) even though they lack access to documents.

Next — audit. Ask whether reads, writes, deletes, schema changes and auth errors are logged. Decide how long to retain logs (7, 30, 180 days) and where, so investigations aren't blocked by lack of disk.

Updates in a closed perimeter must be repeatable: import packages into an internal repository, verify dependencies and compatibility, test on a staging cluster, then schedule a prod window.

Build monitoring around simple charts: RAM and cache growth, disk fill and I/O rate, search latency (p95/p99), errors and timeouts. For government projects RBAC and audit logs are often more important than “+10% search speed.”

Maintainability: what saves time each month

Even if you win performance tests, production is usually decided by maintainability: how easy it is to install, update and repair the DB without internet and without extra dependencies. The vector DB is part of the search chain and directly impacts answer quality.

What to ask before choosing

Before a pilot collect practical answers and ask for a demo on a testbed, not a slide deck: how installation works (single binary or multiple services), any dependency on Kubernetes or external DBs, whether rolling updates without downtime and rollbacks are possible, minimal cluster size and behavior on node loss, available metrics and logs, how restore on a new server looks and how long it takes.

What a “typical month” looks like

Incidents are more often about degradation than full failure. For example, an index grows overnight, disk free space drops to 5% and p95 falls in the morning. In a well‑designed system you quickly see the cause in metrics and logs, can limit background tasks, add disk and safely run compaction during a maintenance window.

Another common scenario is a node failure. Normal operation includes clear replication status, predictable rebuild procedures and documented runbooks: who is on duty, how long partial search is acceptable, when to switch traffic.

To avoid relying on one person, formalize backup and recovery runbooks, maintenance windows, a list of responsible people and a short diagnostic checklist.

Example scenario: RAG for an organization with a closed network

Estimate resources before the pilot

We will estimate RAM, disks and index growth for your on-prem RAG and SLA.

Request estimate

Imagine an organization with a closed network: internet is blocked, documents live on internal resources, and infrastructure updates are allowed only at night on weekends. The team wants to enable search over regulations, contracts and instructions via RAG inside the closed perimeter, without moving data outside. They need to select a vector DB for on‑prem RAG.

Start the pilot not on the whole corpus but on a slice that represents realistic load. Example: 50,000 documents (PDF, DOCX), 2–3 departments, 10 typical queries and 200–300 real search requests from support logs.

In the pilot record the dataset and cleaning rules, configure metadata (department, document type, date, access level), pick 10–20 benchmark questions and quality criteria, run ingestion tests (docs per hour) and search tests (p95 latency).

Problems appear quickly. The index may grow more than expected due to chunking and duplicates. Loading can be slow because of low IOPS or because reindexing interferes with normal work. Recovery is a special pain: in a closed network you can't rely on external storage, and a single‑server full backup doesn't protect against disk subsystem failures.

Document results as clear targets and resource cushions. For example: p95 search ≤ 300–500 ms at N concurrent users, nightly load window ≤ 2 hours, RPO 24 hours, RTO 4 hours. Add disk and RAM headroom (for index growth and new collections), a scaling plan and regular restore tests.

Common mistakes when choosing an on‑prem vector DB

The most common mistake is choosing a solution by one metric: “search speed.” On‑prem RAG lives in the cycle “ingest — index — search — updates — backups — recovery.” If a DB searches fast but loads slowly, backs up poorly or needs complex manual maintenance, predictability is lost.

Second mistake — not planning for growth. Collections grow faster than forecast: new document versions, emails, OCRed scans. Data duplication (replicas, test environments, temporary indexes) means what fit at the start will pressure disk and memory in 6–12 months.

Third mistake — testing without metadata. In production you almost always need restrictions like “only branch X,” “only docs after date Y,” “only type: order.” Filters change latency and relevance, so comparing systems without them yields pretty but wrong results.

Fourth mistake — confusing RAM and disk requirements. If an index needs in‑memory structures, adding disk won't help: the DB will still choke on RAM and cache. Verify what is kept in memory, what is on disk and how behavior changes when RAM is low.

Finally, don't skip recovery tests. Backups are often made “for the record” but never used to bring a clean machine up. A simple test: simulate a node failure, restore the DB and confirm access rights, metadata and indexes return and RAG answers within SLA.

Step‑by‑step selection plan and pilot checklist

Treat vector DB selection for on‑prem RAG as a small engineering project: clear inputs, a pilot on your hardware and fixed outcomes. This way you won't buy a product that looks great in demos but hits RAM, disk or maintenance limits in production.

Typical sequence: describe loads (collection size, growth, target p95, indexing and backup windows, availability), shortlist 2–3 candidates suited to the closed perimeter (on‑prem support, OS requirements, minimal external dependencies), run a pilot on your infrastructure and data, evaluate operations (monitoring, alerts, updates, roles, disaster recovery), and calculate resources and total cost of ownership for 6–12 months including growth.

To make results comparable agree success criteria in advance. Example: “p95 search ≤ 250 ms for 95% of queries, restore from backup ≤ 60 minutes, clear replication scheme.”

A short pilot checklist is enough: p95 latency on typical queries under concurrent load, index growth and reindex time, steady‑state and peak RAM per node (indexing, compaction), SSD/NVMe requirements and behavior under disk degradation, a working restore and the chosen replication scheme (what is lost on failure).

Finalize selection with a weighted criteria table, a list of risks and a scaling plan. Next step after choosing software is to approve server configuration and the deployment contour for pilot and prod. If the project is constrained by selecting on‑prem hardware and ongoing operations, such tasks are often completed through a system integrator and an infrastructure vendor like GSE.kz (gse.kz), especially when delivery, integration and local support in Kazakhstan are important.