Why measure accuracy instead of relying on impressions

When you set up data extraction from contracts and emails, there is almost always a feeling of "it seems to work." A few successful examples are reassuring, but that's a trap. Documents vary too much, and errors often appear where you didn't check.

Usually it doesn't fail "in general" but in details: the system confuses similar fields, skips parts of text, or merges values from different lines. Emails are complicated by signatures, reply chains and quoted text. Contracts bring tables, line breaks and clauses with exceptions.

The most costly mistakes are almost always tied to critical fields: amounts and currencies, dates and terms (especially phrases like 'not later than' or 'within N days'), parties and identifiers (who is the contractor, who is the client, where is the BIN), subject and delivery terms, and attachments and specifications where the key information is often "hidden."

There are also tricky situations where ambiguity is built into the document itself. Conditions may be moved to an appendix, a footnote, or a stamped scan where some text is obscured. In an email, an important date might be in a quoted previous message while the actual date is higher up in a short sentence.

You need quality assessment not for pretty percentages but to understand risk. If the system finds amounts correctly in 49 out of 50 contracts, that sounds great—until you discover the one mistake was in the largest contract and caused a wrong approval.

Metrics give you a foothold: what exactly fails, how often, and how much manual checking you need at the start to avoid an expensive mistake.

First agree what counts as a "correct" result

Any evaluation starts not with formulas but with agreements. If the team has different ideas of what "correct" means, even a good system will look "bad" and the numbers will be disputed.

Fix which fields you extract and set priorities. Otherwise it's easy to waste time on rare fields and notice critical errors late.

A practical approach is to split fields into three groups. Key (affect payment, deadlines, legal risk) — amount, currency, date or term, contract number, parties (legal name, BIN). Important (for search and reporting) — contract subject, address, contact person. Secondary (nice to have but not blocking) — notes, internal codes, additional details.

Next, define the "correct value" for each field. For example, for amount it's the number together with currency; for a party, the official legal name, not an abbreviated form from an email. Often the contract has one wording, a cover letter another, and the signature a third. Choose one source as authoritative or allow several options as acceptable.

Separately, set a unified format. For dates decide whether to normalize to DD.MM.YYYY. For amounts decide whether to keep cents, how to write Tenge (KZT or тг), and how to handle thousands separators. For BIN/IIN and contract numbers agree whether to preserve dashes and spaces and how to treat the symbol "№".

Describe empty values in advance. "Empty" and "not in document" are different. "Empty" may mean the system missed the field, while "not in document" means the system correctly found nothing.

Finally, decide how to treat partial matches. Is "Ivanov Ivan" instead of "Ivanov Ivan Ivanovich" an error or acceptable? Is a contract number without a suffix or year counted? These rules make verification fair and repeatable.

Precision and recall without the math — in plain terms

It's easy to get confused when evaluating extraction quality: the system "almost always guesses right" but some important facts are missing. That's why two metrics are usually considered.

Precision answers: of everything the system extracted, what share is correct. Recall answers: of everything that should have been extracted, what share was actually found. F1 is a single score useful when you want a balance between precision and recall.

Imagine the field "contract number." You have 100 documents, and 80 of them actually contain a number. The system returned numbers in 50 documents. Of these 50 numbers, 45 are correct and 5 are wrong (for example, it picked up an email or invoice number).

Precision: 45 out of 50. The system rarely makes errors when it decides to extract a number.
Recall: 45 out of 80. But it missed many documents that had a number.

This scenario is common: high precision can hide low recall. That happens when the model "plays it safe" and extracts only where confident, ignoring uncertain places.

What matters at the start depends on risk. If a wrong value is worse than a miss (for example, payment amount), aim for high precision first. If missing values is worse (for example, you must find every contract with auto-renewal), prioritize recall.

Calculate metrics by field: number, date, amount, BIN/IIN, term, counterparty. Use an overall summary only as a dashboard, and don't hide critical fields inside a single "average" number.

How to build a control document set

The control set is your reference point. It should be small but diverse so you quickly understand where the system fails and why.

A realistic minimum for the first measurement

For a start, 30–50 documents are usually enough. If you have many types (contracts, emails, attachments), you can take 60–80, but don't try to cover everything at once. Make sure each key type appears at least 5–10 times, otherwise metrics will be volatile.

Pick documents not by "most beautiful" but by "how they appear in real life." It helps to list a few categories and collect a few from each: different contract templates (standard and non-standard), documents from different years (before and after template changes), different departments or branches, different sources (incoming and outgoing emails, contracts from suppliers), and different formats (DOCX, text PDF, scans).

Be sure to include hard cases

If you don't include challenging files, the test will look great and production will break. Keep at least 20–30% difficult documents: poor scans, phone photos, stamps over text, tables, handwritten notes, low resolution. It's unpleasant but honest.

Track sets and origin

It's convenient to have three buckets from the start. Starter set — for the first measurement and tuning. Extended — to add new types and rare cases. Regression — a small set (10–20 docs) you run after changes to ensure you didn't break something that used to work.

One small but time-saving habit: mark the source of each file in the filename or a separate table. Minimal fields: type (contract, amendment, email, attachment), year, department, format (scan or text PDF). Then you'll quickly see, for example, that errors are common in attachments or scans from a specific year.

Gold-standard annotation: lock the rules beforehand

Run the pilot with no surprises

We will help organize the pilot and infrastructure so metrics reflect real operations.

Discuss pilot

If you want evaluation to avoid arguments, first record annotation rules. The gold standard is not an "ideal document" but an agreed answer to what value counts as correct and why.

Create a short annotator guide of 1–2 pages. It should describe not only what to find but where it usually is (email header, contract preamble, payment section, signatures). This reduces variance and speeds up work.

Fix at least:

Field format: how to record dates (DD.MM.YYYY), amounts (with or without currency), numbers (with or without prefixes like №).
Ambiguous cases: what to do if there are multiple dates (signing date vs effective date), multiple amounts (net vs gross), or corrections and amendments.
Source of truth: what takes precedence in conflicts — body text, table, appendix, signature, stamp.
Values 'not found' and 'not applicable': when to mark 'not found' (field expected but absent) and when 'not applicable' (field doesn't exist for this document type).

To check consistency, have two people annotate the same subset and compare differences. If differences are large, the problem is usually the rules, not the people.

Store the gold standard in a simple table: rows are documents, columns are fields. Add a comment column to record exceptions (for example: 'amount is in an appendix', 'amount only written in words in the email'). This prevents repeated debates and helps understand why the system failed.

A small example: a contract has 'Contract Date' in the header and 'Signing Date' near signatures, while an email shows both an incoming date and an event date. If the rule is not fixed, one annotator will pick the first date and another the second, causing metrics to fluctuate even if the system hasn't changed.

Step-by-step: how to run the first quality cycle

The first cycle is to get honest numbers and understand what exactly breaks.

Start with a small control set (e.g., 30–50 files) that includes different document types: contracts with attachments, scans, PDFs from mail, emails with signatures and stamps.

Five steps of one cycle

Run the documents through the full pipeline as in real life: upload, OCR (if needed), field extraction, and saving results.
For each document compare the result with the gold standard for each field. Compare by 'field-value', not 'the whole document.'
Record TP/FP/FN per field: TP — extracted correctly, FP — extracted but wrong, FN — not extracted.
Summarize in a single table: rows — fields, columns — TP/FP/FN, precision/recall and a short comment.
After fixes, rerun on the same set. Otherwise you won't know whether improvements came from changes.

After the first count, analyze errors by cause. That is more important than the raw numbers: you'll know where to fix things.

Common causes fall into several groups: OCR issues (misread digits, missing lines, Cyrillic/Latin confusion), extraction logic errors (unhandled wording variants), format peculiarities (tables, columns, skewed scans, different templates) and human factors (gold standard annotated with inconsistent rules, typos in the "correct" answer).

A practical example: in emails 'Out. No' is sometimes written 'Исх№', 'Исх No' or 'Outgoing'. That will create both FP and FN for the 'number' field. In the report mark it as a 'writing variant' rather than a 'bad model.'

How to organize manual verification at the start

At the start automation almost always misses small details: date format, extra space, swapped counterparty in an email. Manual verification is needed to quickly see where the system fails most and tune rules before pilot.

The team can be small but roles should be separated: one person checks fields against the document and sets status, a process owner (legal, finance, procurement) approves rules for disputed cases, and a technical specialist or analyst collects errors and updates templates and reports.

To prevent verification from becoming an endless stream, work in batches. For example, 50–100 documents per day with a clear readiness criterion: 'document checked, critical fields confirmed or marked as absent.' It's helpful to predefine statuses: 'OK', 'Needs clarification', 'Not extracted', 'Doubtful.'

Log disputed cases once in a decision journal: field, text fragment, decision, reason, date, who approved. That way the same case won't come back every week.

Double-check only where mistakes are costly: amounts, dates, effective periods, BIN/IIN, bank details, contract number. Other fields can be spot-checked.

You can reduce load by prioritizing. Start with 100% checking of critical fields and 10–20% sampling for other fields. Also prioritize emails and contracts with 'red flags': many attachments, poor scan quality, non-standard templates, multiple counterparties in one email.

Errors that make metrics misleading

Infrastructure for AI and data

We will pick AI and datacenter infrastructure if document processing scales fast.

Discuss solution

Metrics can look "good" while users still suffer. Often the problem is not extraction itself but how you evaluate results and collect data for review.

The trickiest error is mixing versions and sources. A test may contain a file after manual edits while production receives a stamped scan with marks. Formally it's "the same contract" but for extraction they're different inputs. If you don't record the source (mail, EDM, scan, template) metrics will be optimistic.

Another common issue is changing annotation rules on the fly. Today 'contract number' includes prefix and year; tomorrow only digits. If the gold standard isn't rebuilt, the system will appear to 'fail' where you simply redefined the right answer.

A third problem is looking only at the overall average. You can have 95% average but fail completely on one field (e.g., BIN/IIN or amount with VAT), making exports unusable. It's better to keep metrics per field and per document type.

Don't be fooled by pretty precision while ignoring misses. If the system extracts only what it is confident about, errors are rare but many required fields are missing (FN). In practice this leads to manual lookup and delays.

And the classic bias — testing on good documents. If you check only recent contracts from one template while production contains old versions and emails with embedded tables, real quality will be lower. Keep the set diverse: years, templates, scan quality and unusual emails.

Short checklist before pilot

Before pilot, make sure you evaluate quality consistently and don't compare apples to oranges.

First, confirm the team agrees on what is extracted. For each field specify format and an example: not just 'contract number' but 'string like KZ-2025/0145'; not just 'date' but 'DD.MM.YYYY'.

Then check basic readiness:

Field list fixed: format, allowed variants, example values.
Control set collected: includes typical documents and a portion of hard cases (scans, poor quality, non-standard templates, long email chains).
Rules for empty and not-applicable values agreed.
Precision and recall calculated separately for key fields.
Error log maintained: what was wrong, why it happened, and planned fixes.

If you have time for only one extra check, do a manual mini-review of 20–30 documents. Look not only at percentages but at real misses. Often numbers look decent but critical fields (amount or term) break due to a couple of typical phrasings.

Example scenario: contracts and emails in an organization

Equipment for public procurement

We will suggest procurement-compliant supply options and preferences for local content.

Clarify terms

Imagine a typical flow: a shared mailbox and ECM receive emails from counterparties, contract scans and attachments. Some documents are PDF, some Word, some phone photos. All go into one processing queue.

The goal is simple: extract document date, number, counterparty, amount and effective term. These five fields are a good place to start: they are debatable, easy to compare, and quickly show where the system fails.

A control set of 50–100 documents is usually assembled to reflect real life. For example: 20–30 contracts of different types (supply, services, amendments), 15–25 emails where amounts or terms are mentioned in text, 10–20 poor-quality scans or non-standard templates, and 5–10 cases with multiple amounts or dates in one document (advance, final, attachments).

Manual verification is then spread over 1–2 weeks to avoid overloading staff and rushed annotations. A common scheme: one person does initial checks, a second confirms disputed cases and the unified rules, and the process owner reviews a short error report every two days to decide what to fix first.

After the first count conclusions are usually consistent. Dates and numbers are extracted well from standard contracts, but counterparties get confused by roles ('Contractor', 'Supplier') and grammatical cases. Amounts often fail on documents with multiple currencies or tables, and effective terms fail on emails where they are described in words ('until the end of the quarter'). These observations give a clear plan: refine annotation rules, add hard examples to the control set, and keep mandatory manual checks only for high-risk fields.

Next steps: from pilot to steady operation

After the pilot make quality control regular rather than visual. Start simple: which fields truly affect money, risks and deadlines, and which errors you can tolerate.

For critical fields (amount, IIN/BIN, dates, contract number, effective term, bank details) set target quality thresholds. Define them separately: some fields require high recall, others high precision. Targets should be clear to the business, not only to the team tuning extraction.

Create a regression set: a small but "nasty" collection that includes typical contracts, poor scans, free-form emails, rare templates, cases with corrections and amendments. After any change (OCR, rules, model, templates) rerun the set so quality doesn't degrade unnoticed.

It helps to separate problems into two layers: text recognition quality and extraction quality. If OCR consistently misreads digits or drops lines, extraction won't improve even with perfect rules. If text is recognized well but the date is taken from the wrong place, adjust extraction logic.

To stabilize operations decide organizationally: who owns metrics and how often they are checked, how you store documents and access with confidentiality and audit in mind, which fields always go through manual verification and when it can be removed, how to handle unclear cases, and what to do if quality drops after updates.

If the pilot shows value, the next step is often infrastructure: where to run processing, how to store data, how to ensure stability. In these tasks GSE.kz (gse.kz) can help as a system integrator: pick servers and workstations for the load, build infrastructure for document processing and provide 24/7 support so the solution works reliably in real flow, not only on a test folder.