Feb 11, 2025·8 min

Multilingual RAG quality testing: metrics and tests

Quality testing of a Kazakh–Russian multilingual RAG: metrics, test sets, terminology and checks for mixed-language queries.

Multilingual RAG quality testing: metrics and tests

What does “quality” mean in a bilingual RAG in practice

RAG is not just search and not just a chatbot. Search must find the relevant passages in documents, and generation must assemble a coherent answer from them. A chatbot without document grounding can sound confident and still be wrong. Therefore, RAG quality is always about the link: it found the right source and answered correctly.

In practice people complain not about a "bad model", but about very specific failures:

  • didn’t find the needed document or section;
  • found something similar, but not on the topic (semantic error);
  • answered correctly but in the wrong language;
  • mixed up terms between Kazakh and Russian or "blended" them;
  • gave a confident answer without relying on found text.

Bilingualism breaks simple checks. The same question can be asked in Russian, Kazakh, or mixed. “Correctness” depends not only on facts but on form: which language was expected, how job titles, abbreviations and names should sound, whether to preserve original quotations.

Before starting tests for a multilingual RAG, agree on rules. Otherwise metrics will disagree with people’s expectations:

  • which language is the default for answers and when it changes;
  • which sources are considered "allowed" (only internal docs or external knowledge bases too);
  • how a source reference should look: quotation, document title, date;
  • what to do when RU and KZ versions conflict (which is the "primary");
  • what a good answer should be: concise, step-by-step, with warnings.

These decisions turn “quality” from a feeling into verifiable criteria.

Types of queries that must be in tests

For a fair check, the question set should reflect how employees actually speak. Otherwise the system will show good metrics on "training" queries but fail in live dialogs.

The first essential class is cross-language queries. An employee writes in Russian but the needed regulation or order is in Kazakh, and vice versa. In such tests it’s important to check not only that the system produced an answer but that it found the correct fragment of the source.

The second class is mixed phrasing. A single utterance often contains RU + KZ, plus translit, typos and colloquial abbreviations. For example: “Маған керек акт сверки по контрагенту, где шаблон?” or “Сатып алу бойынша лимит кто утверждает?”. These queries quickly reveal weaknesses in tokenization and dictionaries.

It’s useful to fix role-based scenarios to cover different tasks and styles. A minimal set of roles:

  • accounting: acts, invoices, period closing timelines;
  • HR: vacations, business trips, hiring and dismissal;
  • procurement: thresholds, approvals, standard forms;
  • IT support: access, tickets, security rules;
  • managers: summary policies and process responsibilities.

A separate risk area is narrow terminology and abbreviations. These are internal codes, system names, department acronyms, equipment models. For example, in a company like GSE.kz staff may ask about “S200”, “M200”, “L200”, “rack server”, “24/7 support”, and documents may record these differently in RU and KZ.

For such cases it’s useful to have tests where the correct answer is impossible without an exact term match and where the term appears in several documents. This is where you most often see whether the system can distinguish similar concepts and not just "guess" the answer.

How to build a bilingual document test corpus

A test corpus for Kazakh–Russian RAG should reflect the real “paper life” of the organization, not an ideal set of a few PDFs. Start with documents employees search for most often: orders, regulations, instructions, FAQs, templates, and internal announcements. For system integrators and manufacturers like GSE this often includes procurement requirements, service process descriptions, warranty terms, and equipment manuals.

To prevent quality from drifting after each update, version documents. Store not only the file but its date, revision number, source (department or system), and a short description of changes. When a document is updated, add the new version instead of immediately deleting the old one: this makes it easier to catch regressions and explain why answers changed.

Prepare bilingual pairs and partial translations separately. In reality some materials exist only in Russian, some only in Kazakh, and some are mixed (headings in KZ, body in RU or vice versa). Try to keep that structure: it matters both for term search and for mixed queries.

A practical minimum corpus:

  • 20–30 key regulations and instructions (KZ, RU, mixed);
  • 30–50 FAQs and short references (they often reveal errors fastest);
  • 10–20 typical letters and phrasings (important for exact quotations);
  • 5–10 documents with partial translations (to test “holes”);
  • 5–10 outdated versions (for freshness tests).

Finally, highlight a “golden set” — a small stable subset run on every index, embedding, dictionary, or model change. It should cover the most frequent queries, critical policies and several complex mixed-language cases.

How to make a test question set and expected answers

Test questions should sound like how employees write in chat: short, with typos, and without “perfect” terminology. For bilingual RAG this is especially important because real queries often mix Russian and Kazakh while the facts live in documents in different languages.

Start with a small set that covers the most frequent work topics. Initially 30–50 questions is usually enough if they are truly "real". Expand the set weekly by adding new questions from ticket history and examples where the system failed or answered ambiguously.

To make expectations verifiable, define them not as a single “ideal answer” but as a set of conditions:

  • which sources must be found (specific documents or sections) and which are acceptable fallbacks;
  • which facts must be present (2–5 short items) and which phrasings are acceptable;
  • which answer language is expected (the question language or the language of the last user turn);
  • important constraints (for example, “do not invent dates”, “if no data — state that nothing was found in the documents”);
  • 2–3 acceptable answer variants if multiple are reasonable.

Simple example for a bilingual base: the user asks “Серверлер S200 бойынша кепілдік қалай беріледі? 24/7 қолдау бар ма?” Expectations can be described as: the system should cite materials about the S200 line and support; the answer must mention 24/7 technical support and the existence of a service network (no invented warranty periods if they are not in the documents).

This format makes tests fair: you check not for a “nice text” but that the assistant found the right passages and conveyed the meaning without distortion.

Retrieval metrics: check that the system finds what’s needed

In RAG, retrieval fails more often than generation: the model can answer confidently but rely on wrong fragments. So evaluation usually starts with retrieval: did it find the correct documents and rank them high.

The most useful metric here is Recall@K. It answers: “If the correct document exists in the database at all, did it appear in the top K results?” For a bilingual corpus this is critical: a Russian query may require a Kazakh regulation and vice versa.

Precision@K complements the picture: how many of the top K results are actually useful. Low precision means noisy results: the system mixes documents that are word-similar but wrong (often due to terms, dates or order numbers). In practice that looks like an employee asking about warranty periods for a workstation and the results include a price list, a presentation and an outdated spec.

When you need to know how high the correct sources are ranked, use ranking metrics:

  • MRR: the average position of the first correct document;
  • nDCG: accounts for multiple relevant sources with different importance;
  • source coverage: share of questions where at least one correct document appears in top-K.

MRR and nDCG are especially useful if you have several document types (instructions, contracts, specs) and you need the normative document to have priority.

Don’t forget speed and stability. Look not only at the mean latency but at the median and tails (e.g., 95th percentile): users notice rare slow responses. If tails grow after an index update, perceived quality drops even if Recall@K stays the same.

Generation metrics: check answer quality

Bilingual assistant pilot
We will run a pilot on your document corpus and assemble a regression test set.
Start pilot

Even if retrieval works, bilingual RAG often fails during answer generation: the model may hallucinate facts, mix languages or miss parts of the question. It’s convenient to rate answers along simple scales and then combine them into an overall score.

Facts must come from sources

The main metric here is factual correctness relative to the retrieved passages. Verify that each claim in the answer is supported by documents the system returned. If the model adds details not present in sources (dates, numbers, conditions), that’s an error even if it sounds plausible. Track the share of answers without hallucinations.

Completeness, language and verifiability

To make checks fast and consistent across reviewers, use a simple rubric (0–2 points per item):

  • completeness: the answer addressed all parts of the question, including exceptions and conditions;
  • source linking: there are quotes or precise references to specific text fragments to verify against;
  • answer language: RU or KZ exactly as expected, without unnecessary mixing;
  • caution: if documents lack data, the model says “no information found in the documents” and asks for clarification.

Example: a user asks in Russian, but the term appears in Kazakh in the document. A good answer will provide a correct Russian explanation while preserving the official term as in the source and show where it was found. A bad answer will translate the term loosely and add nonexistent requirements.

Terminology and dictionaries: separate tests for KZ and RU

In bilingual RAG terminology often matters more than “polished” text. Good retrieval breaks when the same object appears in Russian and Kazakh and also in abbreviations and variant spellings. Allocate a dedicated set of dictionary checks.

First, collect KZ and RU terms from real documents: job titles, department names, document types, product names and standards. For each term record 2–3 variants: translation, synonym, common spellings (Latin/Cyrillic, joined/hyphenated, with/without dots). For example, you may see “S200”, “С200” and “server S200”, and for standards “ISO 9001” and “ИСО 9001”.

Create a small "lexical" test set where correct answers depend on exact term understanding. Usually 30–80 critical terms are enough.

What to test exactly

Verify that the system searches and answers equally well across forms of the same concept:

  • synonyms and spelling variants (KZ/RU, Cyrillic/Latin);
  • abbreviations and internal short forms (full form and 1–2 short variants);
  • close-in-meaning words that must not be confused (position vs department, order vs directive);
  • "noisy" matches where the term appears in a different context.

Word forms without excessive demands

Check inflections and word forms separately. Don’t require perfect literary grammar, but require preserved meaning. Practical rule: minor case errors are acceptable if (1) the term is recognized correctly, (2) the answer does not change roles or responsibilities, (3) the source link or quote preserves meaning.

Simple test scenario: the user asks “Кім жауапты за 24/7 қолдауды?” and “Кто отвечает за 24/7 поддержку?”. In both cases the system should retrieve the same policy fragment and not substitute “support” with “implementation” due to similar words.

Mixed queries and language switching in dialog

Mixed queries look like: “Find the occupational safety order and give a summary in Kazakh” or “Сколько гарантия на сервер S200? жауапты бөлім кім?”. Here you must predefine expected behavior: which answer language takes precedence, whether mixing languages in one reply is allowed, and what counts as an error.

Practical rule: the answer language is set by the user’s last explicit request (“in Kazakh”, “in Russian”). If there is no explicit request, answer in the language the dialog started with. For mixed queries this is critical, otherwise the model will “float” and choose inconsistently.

Test language switching across 2–3 turns. For example: the employee first asks in Russian about a procurement form, then clarifies in Kazakh, then asks for a short answer again in Russian. In tests record not only the final reply but whether the system preserved context.

Partial translation is acceptable when a term is an official name, model code or legal phrase. For example, leave “S200 Series”, “ISO 9001” or a department name as is, but provide a short plain-language explanation nearby.

Short checklist for a test case:

  • the answer language follows the rule (by request or default);
  • terms aren’t translated beyond recognition or distorted;
  • quotations from documents are not paraphrased when precision matters;
  • after a language change the model keeps the meaning of prior turns;
  • the answer does not include unnecessary code-switching.

Set strict boundaries for formal materials (internal orders and regulated replies): tests should expect answers in a single language with minimal rephrasing.

Step-by-step testing process for multilingual RAG

24/7 support
We will take infrastructure support on and help quickly investigate regressions.
Enable support

To avoid subjective debates, start by fixing scenarios and acceptance rules. Example: an employee asks in Russian, but the required answer is in a Kazakh order. The system must cite the correct fragment and not invent details.

Then freeze the data. Save document versions used to build the index and note date, source and language. Otherwise you won’t know whether improvements came from the model or changes in PDFs.

Practical process:

  • define 8–12 scenarios and pass/fail criteria (accuracy, completeness, answer language, mandatory source citation);
  • collect the corpus and fix its version: files, OCR, chunking, index settings;
  • prepare 50–150 questions, including mixed (KZ+RU) and internal terminology; annotate expected documents and fragments;
  • run tests and record retrieval and generation metrics separately;
  • analyze failures by category and apply targeted fixes: chunking, language filters, term dictionary, prompt, ranking.

After fixes, rerun the same tests. If metrics improved but users feel worse, acceptance criteria were likely chosen incorrectly.

A small example for a company like GSE.kz: the question “гарантия на S200 сервер қанша?” should lead to the correct regulation and produce an answer in the user’s language, preserving exact terms from the source. This case checks language switching, terminology and citation honesty at once.

Common mistakes and traps when testing bilingual RAG

The most common trap is judging the system by the “beauty” of the answer and overlooking retrieval failure. The model may write confidently but reference irrelevant fragments. In a bilingual corpus this is masked by similar wordings in Russian and Kazakh.

Another mistake is too small a test set. If you evaluate on 10–20 questions you almost inevitably tune prompts, thresholds and dictionaries to them. Later, real employee queries reveal themes missing from tests and metrics drop.

A separate problem class is document updates. A regulation, spec or instruction changes and RAG keeps answering “as before”: due to cache, stale indexes, wrong deduplication or because an older document still matches keywords better.

To avoid mixing errors:

  • separate checks: first retrieval (did it find the fragment), then generation (how it phrased the reply);
  • keep two sets: a small fast set and a large regression set that you don’t change weekly;
  • add freshness tests: run the same question before and after a doc update;
  • test language and terminology separately: KZ-terms, RU-terms and mixed queries.

Also agree in advance how to behave when documents contain no answer. A good result is an honest “no data in the corpus” with a prompt to clarify, not a guess or a made-up reference.

Quick checklist before rollout and after each update

Architecture consultation
We will choose a RAG architecture and data requirements without vendor lock-in.
Discuss project

Before release or after data updates check specific risks, not just “it works in general.” Ensure both Russian and Kazakh perform equally well, not only one of them.

First, make sure tests cover real employee language: RU, KZ and cross-language queries (question in Russian, answer needed citing a Kazakh document, and vice versa). Add mixed formulations, translit and common typos. These often break retrieval before generation fails.

Minimum checks:

  • separate metrics for retrieval and generation, not a single combined score;
  • the test set includes RU, KZ and cross-language questions, plus in-dialog language switching;
  • tests for mixed queries, translit, typos and short "telegraphic" phrasing;
  • a terminology dictionary: abbreviations, positions, form numbers, system names, with tests;
  • fixed document and index versions: you know exactly what you’re testing and have a post-update check.

Then set quality thresholds and a stop-release rule. For example: if retrieval completeness on key topics falls below an agreed level or the number of answers without verifiable sources increases, do not release the update.

A practical tip: keep a small smoke set of 20–30 questions (balanced RU/KZ with mixed cases) and run it for every change to model, index, dictionary or document set. This quickly shows whether search or wording broke.

Example realistic scenario: an internal assistant for employees

Imagine an internal assistant for a manufacturing and IT organization in Kazakhstan, where documents and communication are in Russian and Kazakh. It’s used for quick answers on regulations, procurement and IT tickets. These cases most often expose hidden problems in multilingual RAG testing.

Typical questions look mundane but contain many traps: variant phrasings, abbreviations, language mixing and local terminology. For example:

  • “What’s the approval timeline for a procurement request under 1 million tenge?”
  • “Маған ноутбук беру тәртібі қандай? Қандай құжат керек?”
  • “Where can I check the Service Desk ticket status and who owns the SLA?”
  • “Is information security approval required to install software if it’s a driver update?”
  • “Қоймаға қабылдау актісін кім бекітеді?”

There are also mixed queries: “Жөндеу үшін RMA қалай ашамын и какие поля обязательны?” and “Для S200 серия серверов гарантия қанша жыл?”.

What failure looks like: retrieval returns a “similar” document (general regulation instead of the current instruction), the model confidently answers, adds nonexistent steps and doesn’t show precise citations. The user follows it and later gets rejected during approval.

What usually helps: rework chunking so procedure steps are not split, add a terminology dictionary and abbreviations (SLA, RMA, IS, KZ/RU variants), and strengthen citation checks (flag answers without supporting fragments as risky).

After fixes retrieval metrics often rise (e.g., share of correct documents in top results) and citation accuracy improves. Hard remaining cases are mixed queries with colloquial forms and rare Kazakh–Russian term pairs. They are usually closed by dedicated test cases and dictionary updates.

Next steps: make testing a regular process

Regularity matters more than one-off checks. Multilingual RAG quality usually drifts after two events: knowledge base updates (new documents, chunking, dictionaries) or a new version of the model, ranker or prompt.

Working rule: run the same test suite (1) before release, (2) right after content updates, (3) on schedule, e.g., weekly, to catch slow degradations.

To make the process stick, assign roles in advance:

  • subject owner: proposes typical employee questions in Kazakh and Russian;
  • annotator: records expected answers and acceptable phrasings, marks terminology;
  • engineer or analyst: runs tests and inspects retrieval and generation metrics;
  • quality approver: decides whether to release changes if metrics dropped.

Store metric history as a “quality passport” by version: date, what changed, metric values, failure examples. This shows exactly what broke: retrieval failing on KZ, answers becoming too generic in RU, or confusion on mixed queries.

Expand the test set when a new error type repeats or reality changes: new abbreviations, internal names or document templates. Add 5–10 focused cases per new category and mark them as "do not break."

If you plan to deploy RAG infrastructure, decide where data will live, how to scale compute and how to organize support. In such projects GSE.kz can help as a system integrator: select servers and a platform for load, design vendor-agnostic architecture and provide 24/7 support.

Multilingual RAG quality testing: metrics and tests | GSE