Why does the model confidently give a wrong answer about a regulation?

“Hallucinations” usually happen when the context contains the wrong clause or a clause without important conditions, and the model fills gaps with familiar wording. First, separate a retrieval problem from a rewriting problem: check whether the correct fragment appears in the top-k results before generation.

Which chunking strategy works best for regulations and policies?

For regulations, start by chunking along semantic boundaries: keep the section heading together with the numbered point or subpoint. This preserves conditions, exceptions and references, and makes it easier to link an answer to a specific clause number.

How to choose chunk size without guessing?

Pick a size that typically fits a rule together with its clarifications, not just a single phrase. In practice, 120–220 words often work for short instructions, and 250–450 words for long regulations. Then refine sizes using real-question tests.

When is overlap needed and how much?

Add overlap when a key idea often spans two adjacent subpoints — for example, roles in one place and deadlines in the next. Start with roughly 10–20% overlap and watch that search doesn’t return near-duplicate chunks; too much overlap adds noise and makes answers mix.

How to tell if chunks are too small or too large?

Too-small chunks produce confident but incomplete answers lacking conditions and exceptions because context is cut off. Too-large chunks worsen precise retrieval: the model paraphrases a similar paragraph instead of the exact requirement. Both are fixed by chunking adjustments and retrieval tests.

What in embeddings most often breaks retrieval of the correct clause?

Check that documents and queries use the same embedding model (or a supported encoder pair), that vectors are normalized consistently, and that preprocessing is identical. Rebuild the index after changing model or dimensionality. Also ensure sentence tails aren’t cut by token limits and that the model handles Russian legal/administrative style if your texts are in Russian.

Which retrieval settings most affect answer accuracy?

Start with a small top-k, usually 3–5 fragments, to avoid dragging in similar-but-irrelevant clauses. Add a similarity threshold: if the best fragment is below the threshold, prefer to ask for clarification or return “no basis” rather than guessing. Use reranking when clauses are lexically similar but differ in conditions.

Why use metadata if embeddings exist?

Metadata prevents mixing versions and scopes when the same term appears in multiple policies. Keep and filter by version/status/date, department or procedure type; also store the clause number and heading in the chunk text so the clause doesn’t float disconnected from context.

How to quickly test RAG quality on regulations without relying on “it feels better”?

First evaluate retrieval separately: for a set of real questions, check whether the correct clause is in the top-k and its rank. Then assess generated answers on three simple criteria: not contradicting the regulation, including important conditions, and avoiding an overly categorical tone when the textual basis is weak.

What to consider before a pilot and for on-prem deployment?

Start with one or two key documents and a fixed test set of questions with reference clauses. Track retrieval metrics (hit in top-k) and a few error types to run regressions whenever chunking, embeddings or retrieval change. For on-prem, ensure automated reindexing and regression checks when documents update. For full projects, system integrators and infrastructure from GSE.kz (gse.kz) often help move from a notebook demo to a stable service.

Chunking and Embeddings for RAG: Fewer Hallucinations in Regulations

Where “hallucinations” come from in regulation answers

Regulations and instructions are structured so that one missed nuance can change the meaning of an entire clause. Models try to answer coherently and often fill gaps with familiar wording. That’s why text can sound confident even when it’s factually wrong.

Context is most often lost where the document’s logic matters more than sentence proximity: definitions, exceptions ("except when…"), notes, footnotes and references to other clauses. Numbering and nested subpoints are a separate issue. If you pull only the middle of a clause, the answer may have the right tone but the wrong conditions.

It helps to distinguish two error types:

Retrieval error: the system didn’t find the needed fragment or found a similar fragment about something else.
Generation error: the fragment was retrieved, but the model paraphrased it in a way that changed the meaning or added extra content.

For internal documentation, “accuracy” is not about elegant phrasing but verifiability. A good answer can be quickly matched to a specific clause: it is either quoted (showing reliance on exact wording) or explicitly tied to a section number and conditions (who, when, and what exceptions apply). Without such anchors, the risk of hallucination grows even if the text seems convincing.

Chunking in simple terms: why answers depend on it

Chunking is splitting a document into pieces (chunks) that participate in embedding search. The model does not read the whole regulation every time. First the system finds several relevant chunks, and then assembles an answer from them.

If chunks are too small, they lack context. For example, you may find the phrase “agree with the manager” but without conditions: for which cases, within what deadlines, and what exceptions. The answer sounds confident but is incomplete or incorrectly generalized.

If chunks are too large, retrieval returns a "wall of text" where the needed clause drowns among similar formulations. The model may paraphrase a nearby paragraph, mix up step numbers, or pick a general section instead of a precise requirement.

Overlap helps when an important idea falls on the boundary between two pieces: the end of one paragraph and the start of the next. But overlap doesn’t fix a poor structure. It’s useful selectively: in instructions with short steps, frequent "see below" references, and formulations that continue into the next point.

Good chunking directly improves citation: it’s easier to bring exactly the clause that states the requirement and repeat the wording without guessing. The practical goal is simple: the retrieved chunk should contain enough context to answer precisely and show where it’s written.

Chunking strategies: what to choose for instructions and policies

For regulations it’s usually best to start by chunking along headings and numbered subpoints. This preserves semantic boundaries: item 3.2.1 stays whole instead of being smeared across neighboring fragments. That’s especially important if you’re doing chunking and embeddings for RAG and want retrieval to hit a specific clause rather than a general section.

Fixed-length chunking (by characters or tokens) is useful when the document structure is broken: scanned text, poor markup, many empty headings. It gives predictable sizes but often loses context: different topics end up together, and one semantic block may be cut in half.

Paragraph-based chunking sounds logical, but in instructions it often breaks lists of steps. “Step 1” and “Step 2” alone quickly lose meaning. It’s usually better to glue a sequence of steps into one chunk until the subheading or procedure changes.

Complex blocks are often better handled separately. Convert tables to text so values don’t stand without column headers. In forms keep a field and its filling rule close together. Appendices are better indexed as separate sections with explicit labels like “Appendix A.”

Pay special attention to “Definitions,” “Exceptions,” and “Responsible parties.” It’s convenient to isolate them into standalone chunks and tag the section type. Then a query like “who approves” won’t drown in a process description.

Example: in an access policy the “Exceptions” section often refers to roles and deadlines from the main text. If you chunk by headings, keep the referenced paragraph together with the nearest subpoint it depends on. That reduces the chance the model will invent a rule instead of quoting.

Chunk size and overlap: how to choose without guessing

Chunk size and overlap determine what ends up in context. Too large — retrieval returns similar text but not the exact clause. Too small — conditions, exceptions and cross-references disappear. In “chunking and embeddings for RAG” this is often the first setting that gives a noticeable accuracy improvement.

Rule of thumb: start with semantic boundaries (headings, numbered points, subpoints), then pick a size according to document style. For instructions with short items, 1–2 items per chunk (roughly 120–220 words) often suffices. For regulations with long descriptions, 250–450 words captures both the rule and its clarifications.

Overlap is necessary when an answer must rely on two neighboring points (for example, “who approves” in one subpoint and “deadlines” in the next). In such cases start with 10–20% overlap: that usually catches the boundary without turning every chunk into near-duplicates.

Simple signals that parameters are wrong:

Chunk too large: search results contain a lot of general text, and answers confuse conditions and exceptions.
Chunk too small: answers sound confident but are often incomplete and lack caveats.
Overlap too small: important phrases at boundaries are missed.
Overlap too large: retrieval returns duplicates and the variety of sources in context decreases.

Mini-test: take 10 typical regulation questions and see whether the correct clause appears in the top-3 results. If you often get “near but not right” — reduce chunk size. If you get “right but missing conditions” — increase chunk size or add overlap.

Embedding settings: what affects hitting the right clause

Embeddings compare the meaning of a query and a text fragment, not just word overlap. If the embedding model handles your language or document style poorly, retrieval starts returning pieces that are similar in wording but miss the needed clause. In RAG that quickly becomes hallucinations: a confident answer that relies on the wrong paragraph.

The first thing to check for regulations and instructions is language and domain. A general-purpose model may work well on news but miss text with numbering, abbreviations, terms and bureaucratic style. A typical symptom: a query about a specific duty repeatedly returns the general “Terms and Definitions” section.

There are also technical settings that are easy to break silently. For example, if you changed the model or dimensionality but kept the old index, quality drops sharply.

What to check in embedding settings

Check the basics:

The same model is used for documents and queries (or a correct paired scheme if encoders differ).
Vector normalization is enabled and consistent everywhere (cosine similarity usually expects L2 normalization).
Dimensionality matches what the index stores (rebuild the index after switching models).
The model actually performs well on Russian and on your phrasing.
Input length: important sentence tails aren’t cut by token limits.

A common trap is differing text preprocessing. If during indexing you remove dots, point numbers and "extra" spaces, but leave the query unchanged, embeddings become less comparable. Keep rules identical: same layout, same cleaning, same normalization.

If you’re configuring chunking and embeddings for RAG, start simple: take 20 real questions, inspect the top-5 retrieved fragments for each, and mark whether the exact regulation clause is among them. This quickly shows whether the issue is embeddings or chunking and search.

Search settings: top-k, thresholds and metadata

Workstations for the project

We will select workstations for the team preparing documents, tests and indexes.

Request quote

Even with good chunking, RAG often breaks at retrieval: the context contains wrong fragments or too many fragments, and the model begins to "stitch" an answer. So check search settings as carefully as chunking and embeddings.

Top-k is how many fragments you put into context. More is not always better. If the regulation is long and clauses are similar (for example, “approval deadline” across procedures), extra fragments create noise and the model chooses the "most convincing" rather than the correct one. Starting with 3–5 is convenient; increase only when a question regularly requires several clauses together.

A similarity threshold helps avoid guessing. If the best fragment is below the threshold, it’s better to refuse to answer or ask for clarification (document version, department) than to produce a polished but incorrect statement. For regulations this is critical: a single number or condition can change meaning.

Reranking is a “second judge.” Fast embedding search finds candidates, then a reranker re-examines and reorders them by relevance. It’s useful when clauses are lexically similar but differ in conditions.

Metadata reduces errors more than people expect. Useful filters include:

document version and effective date
department or branch
procedure type (procurement, leave, info security)
document language
status (draft/approved)

Finally, hybrid search (keywords + embeddings) works well for terms, codes and clause numbers: “ISO 9001”, “p. 4.2.1”, “S200”, internal form codes. Embeddings catch meaning; keywords pull exact matches where semantic search misses.

How to check quality: short tests instead of “it feels better”

When you change chunking or the embedding model, the impression that “it’s more accurate” can be misleading. Use short, repeatable checks that show whether the system actually retrieves the correct clause before generation.

Assemble a small set of real-life questions: support tickets, disputed cases, frequent “what if” queries. For each question mark which clause(s) count as the correct basis (1–3 fragments). This becomes your reference.

Run the same set across variants (different chunk size, overlap, model). Record results in a table rather than relying on impressions.

Evaluate retrieval separately from generation:

Was the correct clause in the top-k?
What position did it occupy?
How many “similar-but-wrong” fragments were in the results?
Are there cases where the clause wasn’t found at all?

Separately, evaluate generated answers with simple metrics: accuracy (doesn’t contradict the regulation), completeness (important conditions included), and “overconfident tone” (categorical language when the supporting text is weak).

Example: a question about returns and exceptions. If the system consistently returns the general “Returns” section instead of the subpoint “Exceptions for opened packaging,” the issue is likely chunking/context, not model intelligence. Such cases best show whether chunking and embeddings for RAG improve real accuracy.

Step-by-step: how to tune chunking and embeddings for RAG

Reliable tuning starts not with picking the "strongest model" but with a clear pipeline: consistent preprocessing, stable chunks, verifiable search and a short improvement cycle. Doing this step by step reduces hallucinations because the assistant more often sees the whole clause.

Prepare the text so structure is preserved. Keep headings, numbering, subpoints, section titles, versions and dates. If the document refers to “see clause 3.2,” leave that text intact, otherwise links break.
Choose a chunking strategy and metadata. For regulations it’s usually best to split by semantic blocks: section - subpoint - paragraph. Add metadata for document, section, clause number, version, date, and owner.
Compute embeddings and build the index. Ensure identical phrases from different documents don’t "glue" together because of identical headings.
Configure context retrieval: how many chunks to take and when to "not answer." Start with a small top-k and a similarity threshold. Add reranking if close clauses are often confused.
Run tests and iterate. Use 20–30 representative questions (e.g., “who approves leave” or “response time to a request”) and check: is the correct clause found, are caveats preserved, is version mixing happening?

If retrieval accuracy improves after these changes, answers usually get better even without swapping models.

Difficult spots: PDF, tables, numbering and versions

Document diagnostics

We will run a short check: OCR, tables, numbering and how well the right clause is retrieved.

Check readiness

Many RAG errors stem not from the model but from how a regulation was converted to text. This is especially clear with PDFs, tables and tightly numbered documents.

Scanned PDFs after OCR often lose structure: words run together, breaks disappear, columns swap. As a result embeddings see gibberish and fail to hit the right clause. At minimum run OCR that preserves paragraphs and manually check a few pages with tables and lists.

Numbering and subpoints (e.g., 2.3.1) matter as much as text. If you remove numbers during chunking or don’t attach a section heading, meaning is lost. A good practice is to keep the clause number and headings as part of the chunk text, not only in metadata.

Tables pose the problem that queries often reference a column heading while the answer is in a cell. If headings aren’t in the same fragment, retrieval won’t find the row.

Tables: how to prepare them so they are found

Common approaches:

Turn each row into a short “passport”: headings + values + section context.
Duplicate column headers in every row (this increases text but improves hits).
Extract key fields into separate phrases if cells are very short.

Versioning is critical: answers based on outdated editions look like hallucinations. Store date, version number and status (active/archived) in metadata and by default filter to active documents.

Short commands in instructions ("Click", "Open", "Check") mix easily if a chunk captures steps from different procedures. Chunk by procedure boundaries and add the procedure name and applicability conditions to each step so a step doesn’t live without context.

Common mistakes and traps that hurt accuracy

The most frequent cause of misses in RAG is not a "bad model" but how the text is presented. Poor chunking causes embedding search to return lexically similar fragments that are not the clause the answer should rely on.

Typical traps:

Chunked by fixed characters and cut meaning: heading in one chunk, subpoints in another.
Overlap too large: system gets nearly identical chunks and the model blends neighboring wording.
Merged documents into one chunk or lost metadata: the answer may come from a different policy.
No refusal rule: when nothing fits, the system still invents a confident answer.
Tested only on “pretty” questions: users write differently, with abbreviations and colloquial phrasing, and retrieval misses.

Simple example: an instruction has a “Business travel” section and a subpoint “Daily allowance.” If you cut off the heading, the chunk with allowance numbers looks like a random table and retrieval pulls similar numbers from “Representation expenses.”

Quick measures are the same: chunk by structure (heading plus subpoints), keep overlap moderate, store metadata as filters, add a rule “no source — no answer,” and test on real chat and ticket formulations.

Quick checklist before pilot

Infrastructure for RAG

We'll choose GSE S200 servers for indexing, reranking and query load.

Select server

Before pilot, agree on rules that make the system find the right clause and honestly say when the textual basis is missing. For instructions and regulations an answer without a specific clause link almost always becomes a guess.

Check basics once, then change settings only through tests:

Chunking follows document structure: sections, points, subpoints.
Each chunk includes a minimum for verification: clause number, heading and a short context.
Text is cleaned identically during indexing and query processing.
Search is configured for precision: chosen top-k, a similarity threshold and a clear “no basis found” mode.
You have a test set (15–30 questions) that you run after every change.

Practical tip: take 5 typical user questions and see whether the clause you would quote manually appears in the top-3 results. If not, postpone the pilot until chunking or thresholds are fixed.

Practical example: one regulation, two chunking approaches

Simple scenario: an employee asks via chat how to approve a purchase of a certain amount and whether exceptions exist (urgent repairs or sole-source suppliers). The regulation has a general rule at the start and an exceptions subpoint later.

Before tuning the system often answered by default: it used the general paragraph about limits and approval as the only correct source. The exception in another place didn’t appear in results and the model began to invent details: who approves, deadlines, what documents to attach.

What changed in chunking:

split the document by subpoints while preserving numbering (e.g., 3.2.1, 3.2.2)
pulled definitions and exceptions into separate chunks, even if short
added a short header in each chunk with the section name so context stays clear

Check was pragmatic: before generation we verified whether the exception subpoint was in the top-3 embedding results for queries like “purchase above limit but urgent.”

User-visible result: answers now more often rely on specific clauses and wording rather than paraphrase “as remembered.” And if an exception wasn’t found, that appeared at retrieval time instead of after a confident, wrong reply.

Next steps: turn settings into a working process

To keep chunking and embeddings from being “notebook tweaks,” start with a narrow real use case. Take 1–2 priority documents and one question scenario that is asked most often: “what to do in an incident,” “who approves,” “what are the deadlines.”

Build a small pilot and agree in advance what success looks like. Define unacceptable errors (e.g., mixed-up deadlines or safety requirements) and how you will catch them before launch.

A 2–3 week plan can be simple:

Choose one regulation and one question type (employee FAQs or support answers).
Collect a test set of 30–50 questions with reference clauses.
Fix metrics: hit rate for correct clause in top-k, share of “don’t know” answers, number of false citations.
Assign a quality owner: who approves changes to chunking, embeddings and retrieval.

Then design life after the pilot: where the index will live, how to update versions, what to do when a regulation changes. A good rule is that a document update should automatically trigger reindexing and a short regression run on the test set.

If you need a turnkey project, it’s often easier to involve a system integrator familiar with corporate needs: access, audit, data storage and operations.

For on-prem scenarios you can rely on infrastructure and support from GSE.kz (gse.kz): servers and workstations for RAG deployment, system integration services and 24/7 support so the pilot doesn’t depend on a single engineer and won’t break during updates.