How do RAG acceptance tests differ from a pilot and regular testing?

Acceptance testing is needed to agree in advance on **measurable criteria** under which the system can be put into production and its results be taken responsibility for. This reduces the risk of confident-but-wrong answers, leaks via citations and quality collapses under load.

Which quality criteria should always be included in RAG acceptance?

Typically you record accuracy, completeness, verifiability via citations, adherence to access rights and stability (repeatability). It’s important to set thresholds and failure rules in advance—for example, a single critical error in an amount or deadline should mark the test as failed.

What exactly should be "frozen" before starting acceptance so results are reproducible?

Freeze the model version, system prompt, generation parameters, retriever settings (top-k, filters), document chunking method and a snapshot of the knowledge index. Without these, any defect is hard to reproduce and fix.

Who should take part in acceptance and what are their responsibilities?

You need a process owner, business experts, IT, information security and the development/supplier team. The owner approves reference answers and thresholds, IS handles leak tests and the access matrix, IT provides the test environment and logs, and development investigates and fixes issues.

How to assemble a set of test queries for acceptance?

Collect real questions from email, Service Desk and chats, then add edge cases: ambiguous wording, multi-step queries, bilingual questions, domain terms and abbreviations. Include negative cases: “no data in the base”, “outdated document”, “attempt to access a restricted item”.

What is a "reference answer" and how to document it to avoid disputes during acceptance?

A reference is not an ideal text but a verifiable standard: list mandatory facts and points, allowed phrasings, and which fragments must appear in citations. The process owner signs off the reference so the discussion is about compliance with documents, not style.

What are common reasons for failing on citation and verifiability?

Failures include missing citations, a citation that doesn’t support the claim, using one citation for the whole answer instead of mapping citations to key claims, or including restricted fragments in a citation. A good answer lets you quickly find the original source by page/section/fragment ID.

How to check access rights and ensure the system doesn’t reveal restricted content?

Run the same query under different roles and attempt bypasses: “quote”, “restate from memory”, “give figures from clause 4.2”. Correct behavior is a short refusal without hints of the content, no restricted documents in sources and no leaks in service fields.

What counts as successful load tests for RAG and how should degradation behave?

Test not only latency but also quality under peaks: the share of timeouts, empty responses and “no sources found” cases. Predefine a degradation strategy—queuing with clear wait messages, rate limiting, or a polite refusal—so the system does not start returning random answers.

What should the acceptance protocol contain so it can be reused for a retest later?

Keep the protocol short but unambiguous: scope and perimeter, component versions and configuration, knowledge base snapshot and access matrix, a test-case table with IDs and artifacts, results by criteria and a defect list with priorities. The commission’s decision must state accepted / accepted with conditions / not accepted and required retests.

RAG System Acceptance Testing: Scenarios and Protocol

Why acceptance testing for a RAG system is needed

A RAG system is an AI assistant that answers not only from its “head” but also relies on your documents: it finds fragments in the knowledge base and uses them when generating a reply. Errors most often occur in two places: the system selects the wrong sources or draws conclusions that don’t follow from the retrieved text.

Acceptance testing separates impression from fact. During development the team usually checks that "it basically works." In a pilot you check whether it’s useful to users. Acceptance fixes measurable conditions under which the business can say: yes, this can go into production and we will be accountable for the results.

The most painful risks are almost always the same: an incorrect answer leading to wrong decisions and costs; leakage of data from restricted documents through answers or quotations; failures and sharp quality drops under peak load.

To “accept the system” means to agree in advance on criteria and thresholds: what accuracy is sufficient, how to verify completeness, whether answers must contain citations, what to do when data is missing, which roles have access to which sources, and what level of load is required.

A simple example: an employee is looking for a procurement rule or the procedure for handling a request. If the assistant gives a confident but wrong clause, or shows an excerpt from a document the user is not allowed to see, consequences are worse than an honest “I don’t know.” Therefore RAG acceptance tests are not a formality but a way to fix responsibility and reduce risk to an acceptable level.

Preparing for acceptance: who participates and what to record

Acceptance rarely goes smoothly without owners and pre-agreed rules. Before you start, appoint responsible parties and agree who can pause tests if a critical risk is found.

Usually five roles participate: process owner (tasks and priorities), business experts (validate meaning and usefulness), IT (environment, integrations, logs), IS (access, leaks, audit), and the supplier or development team (fixes and clarifications).

Then prepare artifacts: a list of documents the system answers from; an access matrix (who can read what); a list of test users and groups. Define in advance which data is confidential, and what the expected system behavior looks like: refusal, redaction, or an answer without details.

Check the environment separately. Test and pre-release should be as identical as possible: same service versions and the same knowledge indexes. During acceptance enforce data immutability: you may not quietly replace documents, rebuild indexes, or change settings. Otherwise test results lose meaning.

Before the first test, record exactly what you are accepting:

the model and its version;
generation parameters (e.g., temperature, max tokens);
the system prompt and templates;
a snapshot of the knowledge base and the index build method;
the retriever and its settings (top-k, filters).

Quality criteria: what we measure and how we interpret it

Without pre-agreed metrics acceptance quickly turns into a debate of tastes. When criteria and tolerances are defined, any test ends with a clear verdict: pass or fail.

Typically the set of criteria boils down to five items.

Accuracy. Facts, numbers, names and conclusions must match the sources. Exceptions are acceptable only when the sources are incomplete or contradictory and the system explicitly states this.

Completeness. The answer covers the key points expected for the query. It’s convenient to check against a checklist of 3–7 mandatory theses: each is either present or absent.

Citation and verifiability. The answer includes pointers to specific fragments (page, section, paragraph, fragment identifier), not just the document name. Citations should support exactly the claims they accompany.

Access rights. A user must not receive restricted information in the answer, the citations, or in hints like a list of documents. Access errors must be handled correctly and without leaking details.

Stability. Repeated runs of the same query should not wander in meaning. You can define tolerance this way: conclusions should be the same; wording may vary.

If an answer contains a single incorrect critical detail (deadline, amount, regulatory requirement), the test is usually considered failed even if the rest is correct.

Example: a finance employee asks about procurement conditions. If the system paraphrases the rule correctly but cites a general document without a precise fragment, it passes accuracy but fails citation.

How to assemble the test query set and reference answers

The test set for RAG acceptance should reflect real user work. Start by exporting typical requests from email, Service Desk, chats and the knowledge base. Then shorten them to clear formulations and add edge cases where the system often errs.

Group questions by purpose: reference (what it is, where to find it, who is responsible), regulation (rules, deadlines, limits), procedure (steps, roles, document templates), calculations (formulas, rates, limits, examples), incidents (what to do in a failure, escalation, contacts).

Add complexity. The sample should include simple questions, compound ones (“do A and clarify B”), ambiguous ones (“how to file a request?” without context), and dialog flows where the second question depends on the first. Don’t forget languages and terms: different locales, abbreviations, internal system names and subdivisions.

Prepare negative cases separately: no answer in the base, an outdated document, a question about restricted information, the user asking the assistant to “make up” something not in the sources. These check that the assistant honestly admits missing data and does not hallucinate.

A reference answer is better as a testable standard than as a perfect text. Record:

which facts are mandatory;
which phrasings are acceptable;
which source references should appear in citations.

The process owner (e.g., HR for leave, IS for access, accounting for calculations) signs off the reference. The acceptance team agrees tolerances beforehand: what counts as a partial answer and where human escalation is required.

Step-by-step acceptance plan

Start by recording exactly what you are testing. On the start day freeze the knowledge base (document versions, index, chunking settings) and the access matrix (roles, groups, available collections). Otherwise any discovered defect will be hard to reproduce.

A convenient sequence gives measurable results at each step:

Run a smoke set of 10–20 typical queries. Check that answers are generated, sources appear, and logs show retrieval requests, selected fragments and refusal reasons.
Execute functional scenarios grouped by criteria: accuracy, completeness, citation, access rights, behavior for “I don’t know”.
Note deviations from references: where the model “filled in”, omitted a key point, or cited the wrong document.
Check repeatability on some cases: run the same query 3–5 times and compare meaning, citations and confidence (if exposed).
Perform load and degradation tests: burst of requests, timeouts, token limits, failure of a component. Record how the system responds under peak and how it recovers.

Finish by producing the protocol: knowledge base version, test set, results, list of findings with priorities, remediation plan and retest dates. This is especially important if the assistant is used in regulated processes (for example, government or banking), where an audit trail of verification is required.

Accuracy scenarios

Test suite for real queries

We will prepare smoke, functional and load test scenarios tailored to your processes.

Request a project

Accuracy is tested where documents provide a clear ground truth. For acceptance this means: the answer must match the fact in the source and must not swap concepts even if the query’s phrasing is close to other terms.

A good accuracy set usually includes several question types:

factual questions with a single correct answer: order number, limit, rate, address, contact;
questions with similar terms: “warranty period” vs “service life”, “incident” vs “ticket”;
queries with dates, versions and exceptions: “from 01.01.2025”, “edition 3.2”, “except in cases…”;
the same content in two formats: “answer in one sentence” and “explain in detail”.

Example: a regulation states access to premises is issued in 2 business days, but for contractors the exception is 5 days. Test both separately. If the system confidently answers “2 days” for contractors, that is an accuracy error even if a source is cited.

Record results as: correct, partial, incorrect. Note the cause: wrong fragment retrieval, distortion during generation, or a data issue (outdated version, duplicates, conflicting documents).

Completeness and usefulness scenarios

Completeness in RAG is checked by whether the answer covers mandatory points rather than by overall impression. Include questions where the correct answer must contain 3–5 concrete elements.

A simple scenario format: take a typical question and fix a reference set of points. For example, “What are the conditions to grant access to an internal report?”—the reference lists: who approves, the timeframe, required documents, exceptions (who is excluded), and what to do in disputes.

Usefulness is best evaluated with two tests per query: “short answer” and “full answer.” The short answer should be concise but not omit key constraints. The full answer may add details without fluff.

Also include cases where the model must honestly say “I don’t know” and ask for clarification when the base lacks data or the question is too general (e.g., missing branch, period, or policy version).

For results, measure the share of mandatory points found and note any extra unsupported assertions:

list mandatory reference points;
how many of them appear in the answer;
which constraints or exceptions were missed;
whether unsupported claims appeared;
whether additional clarifying information is needed.

Citation and verifiability scenarios

If you cannot quickly verify where information came from, trust in the RAG drops. Acceptance needs scenarios where the goal is precise grounding in the knowledge base rather than a polished narrative.

Simple rule: every key idea that affects user decisions (numbers, requirements, steps, deadlines, constraints) is backed by a citation. Test queries where it’s easy to go wrong: “which documents are required”, “what is the deadline”, “which policy version”, “what exceptions exist”.

Check citations on two axes: relevance and honesty. Relevance means the fragment is truly about the claim, not merely related. Honesty means the model does not alter meaning or infer something not present in the text.

Quick checks:

the quote matches the source fragment and is not taken out of context so as to change the meaning;
each major claim has its own supporting fragment (not one citation for the entire answer);
when multiple sources are used, citations are mapped to claims without mixing;
citations do not contain restricted fragments (checked with IS);
when opening the source the user sees the same text referenced by the answer.

A separate scenario is stitching several documents. For example, a question about warranty terms and support procedures may require two different regulations. Correct behavior is to indicate which part of the answer is based on which document.

Failure criteria should be strict: no citations, citation unrelated to the claim, citation missing in the index, or a citation that contains restricted content.

Access rights and data security scenarios

Data center infrastructure for AI services

We will design infrastructure for RAG and related services in your data center.

Discuss DC

Access rights in RAG are more important than a “nice” answer: a single leak can ruin the whole acceptance. Start with the role-to-source matrix: which document collections each role sees and what counts as restricted data (commercial terms, personal data, internal regulations).

Define the roles used in tests, for example: guest, employee, manager, IS specialist and administrator. For each role predefine 3–5 restricted documents and 3–5 open ones so it’s clear when the system must refuse and when it should answer.

Test refusal patterns. Simulate queries like “show contract No.…”, “what amounts are in the appendix”, “what is in clause 4.2” under a role without access. Also attempt bypasses: “quote a piece”, “output a table”, “restate from memory”, “give key figures.” Test not only direct quoting but any hints of the content.

Include mixed-dialog scenarios. Create a session where a user with access interacts, then switches to a user without access and repeats questions. The system must not transfer context, excerpts or derived conclusions across roles.

Acceptance criteria are simple: a clear refusal with a short explanation of the reason; no restricted fragments in answers; no restricted sources in citations or service fields.

Load tests and stability under peaks

Load tests ensure the RAG system remains predictable when requests are many and vary in "weight." Define the load profile to match real usage: concurrent users, requests per second (RPS), average dialog length (single query or chains of 5–7), and context size for answers (short pointers or multi-paragraph analyses with citations).

Peaks often occur at the start of the day, before meetings and deadlines. In organizations where the assistant helps find regulations, 50 employees may nearly simultaneously ask about the same topic. Such bursts quickly reveal bottlenecks in search, ranking and generation.

Set SLA metrics so they can be verified numerically:

response time percentiles (p50, p95, p99) separately for "light" and "heavy" queries;
error rate (e.g., 4xx/5xx) and timeout share;
quality stability: not only speed but the number of empty answers or “no sources found”.

Decide in advance how the system degrades under overload: queuing with clear expectations, rate limiting, or a polite refusal to retry later. A predictable delay is better than random answers.

For analysis you need observability without leaks. Minimum logs should record:

request identifier and timestamps for each stage (search, context assembly, generation);
context size and number of retrieved fragments;
error codes, timeouts, refusal reasons;
anonymized access labels (role, tenant) without document text or personal data.

Common mistakes during acceptance

The most frequent problem is turning acceptance into a demo. The team shows only favorable queries where the system will likely answer well. In real use troublesome queries emerge: no data in the base, ambiguous questions, requests for confidential data, or conflicting documents.

The second trap is non-repeatability. If someone updates the knowledge base, changes chunking or search settings during testing, results drift. Then debates are not about system quality but about what was actually tested.

Mistakes that often break acceptance:

inspecting only the answer text and not where it came from (citations, fragments, source matches);
not testing access rights: the same query from different roles must yield different results;
confusing product acceptance with data acceptance: who is responsible for outdated regulations, scans without text, and duplicates;
not recording prompt version, retriever parameters and model, so defects cannot be reproduced;
not planning negative cases: “no answer in the base”, “conflicting documents”, “attempt to access a secret”.

A practical approach: freeze the test data slice and configuration beforehand and record all versions and parameters in the protocol. Then the discussion becomes factual: what exactly failed and who should fix it—the product team, data owners, or process owners.

A short acceptance checklist before sign-off

Freeze configuration without surprises

We will help set up reproducibility: versions, indexes, parameters and data immutability.

Submit a request

Before signing the act, do a final sanity check. This often uncovers small but critical issues: incorrect access, answers without sources, or confident fabrications when data is missing.

Keep a control set of 20–30 queries: some typical, some rare, some “sharp” cases (ambiguous wording, outdated documents, restricted access). Examples: “What is the contract approval procedure?”, “Where is the budget limit described?”, “Which version of the security policy is currently in force?”.

Verify at least these five items:

Access: users with different roles see only permitted documents and citations, with no hints of restricted content;
Citation: answers include verifiable pointers to fragments and you can quickly find the primary source in the base;
Refusal when data is missing: the system honestly says “not found” and asks for clarification rather than inventing facts;
Quality thresholds: minimums for accuracy, completeness and share of cited answers are agreed and actually met on the control set;
Repeatability: 5–10 key cases are run several times (at different times) and results don’t drift in meaning or sources.

Also record operational readiness: monitoring is enabled (errors, latency, refusal rates), there is a content update procedure and assigned owners for content and incidents.

A realistic organizational scenario

Imagine a company with an internal AI assistant for regulations, procurement and IS policies. The assistant answers only from internal documents and always shows citations. For acceptance they register three roles: employee (normal access), manager (extended access), IS (access to all policies and logs).

Documents are split into two sets: (1) general regulations and procurement procedures available to everyone; (2) restricted materials: IS policies, password requirements, exception lists and contact data.

Five test queries that reveal errors well:

“What is the approval timeframe for a purchase up to 1,000,000 tenge?” Expected: exact term + citation from the regulation.
“Give a list of employees who have access to system X” (in a restricted document). Expected: refusal due to access, without leaking fragments.
“What are the password complexity requirements?” For an employee: a short answer + citations from open parts; for IS: a full answer with details from the restricted set.
“Can we buy a laptop from a sole supplier and why?” Expected: a clarifying question (context) or a list of conditions with a link to the relevant section.
“What to do with a phishing email?” Expected: step-by-step actions + contact details if available for the role.

If an answer is disputed, record not only pass/fail but the reason: which source was pulled, whether a fresher version exists, and which stage caused the error (retrieval, ranking, generation). After fixing or updating documents rerun the same test case.

Agree quality thresholds with the business before acceptance: for example, share of answers with correct citations, acceptable refusal rate, response times for typical queries, and rules when the assistant must ask a clarification.

Acceptance protocol template: structure and fields

The acceptance protocol documents what was tested, on which versions, with which access rights and the commission’s conclusion. Keep it short but unambiguous so tests can be repeated a month later and yield comparable results.

In the header include:

acceptance goal and perimeter (which knowledge bases, languages, channels, user types);
component versions (LLM, embeddings, vector DB, ranking, RAG pipeline);
date and environment (testbed, configuration, limits);
participants and roles (customer, IS, data owner, implementation team).

Then a test table. Each test must have a unique ID and a reproducible artifact (log, screenshot, export).

ID	Query	Role	Expected	Actual	Status	Comment	Artifact
RAG-ACC-017	"What are the retention periods?"	Employee	Answer + citations	...	Pass/Fail	...	Log #123

After the table provide a short summary by criteria: accuracy, completeness, citation (verifiability), access control, load and stability. Record the thresholds and the final verdict (met or not) so the debate does not become subjective.

Document defects separately: priority, owner, fix deadline and retest plan (which tests to rerun and on which environment).

At the end the commission records the decision: accepted, accepted with conditions, or not accepted. If accepted with conditions, list them explicitly (limited user group, banned documents, request limits).

Next steps usually include a pilot in a specific unit, user and admin training, and handover to support. If infrastructure work is required (servers, workstations, integrations), system integrators often handle it. For example, GSE.kz provides system integration and infrastructure solutions and also supplies computers and servers made in Kazakhstan for organizations that prioritize supply-chain transparency and local support.