At which stages does data quality most often degrade in a DWH?

Most failures occur at intermediate steps: staging, normalization, deduplication, joins with reference data and metric calculations. A small error there can quickly become a wrong metric across all marts.

What symptoms commonly indicate a data problem in the warehouse?

Typical signs are mismatched reports between marts, sudden drops or spikes in metrics, more duplicates in customers or contracts, missing data after a load, and an increase in values like `unknown`. If these symptoms repeat, it’s usually a data-quality issue rather than random noise.

Which types of data suffer the most and why?

Reference tables break joins and groupings, transactions cause volume and amount gaps or spikes, and master data creates duplicate customers and fragmented histories. These three data types most often affect KPIs and user trust.

What quality checks are needed in almost any DWH?

To start, completeness, uniqueness, referential consistency, allowed values and formats, and timeliness are usually sufficient. They catch most failures before a mart is published and are easy to explain to the business.

How to formulate a validation rule so it doesn’t need to be rewritten?

Describe the rule in plain language first so it’s understandable without SQL, then specify the object, period, formula and threshold. Define severity and the action on violation — otherwise the check will be a checkbox and won’t help during an incident.

How do technical checks differ from business quality checks?

Technical checks answer whether the load ran correctly — types, keys, duplicates, integrity. Business checks verify whether the data makes sense for reporting: value ranges, logical relations and expected shares by status.

How to distinguish a metric anomaly from a real data error?

An error is usually a clear rule violation, e.g. a required field becoming `NULL` or a missing reference. An anomaly is a suspicious deviation (sharp spike or drop) that may be a real business event; treat anomalies as signals for investigation rather than automatic blocking.

How to set alert thresholds so they aren’t noisy?

Static thresholds work for things with fixed bounds, like allowed share of `NULL` or negative sums. For volumes and aggregates, dynamic thresholds are better: compare with recent history and account for seasonality, otherwise alerts will either scream or stay silent.

Where is it best to put quality checks: at input, in transformations, or in marts?

Place checks at three points: immediately after extraction from the source, after major transformations, and before publishing the mart. That way you locate the failure faster and fix it cheaper, before users see incorrect reports.

Who should be responsible for quality and what if the problem is infrastructure-related?

At minimum, you need a data owner who approves rules and priorities, and an engineer who implements checks and reprocessing, plus a steward for definitions and reference data. For critical contours, monitoring and DWH stability are essential; locally this is often handled with on-prem infrastructure and 24/7 support, for example rack servers S200 Series and service by GSE.kz.

Data Quality Control in DWH: rules, alerts, roles

Where data quality usually breaks in a DWH

Data quality is especially critical in the warehouse: the DWH becomes the "single source of truth" for reports, marts and models. Even if sources are mostly fine, errors amplify in the DWH. Data is merged, cleaned and used to compute metrics there, and a small issue turns into an incorrect KPI for many consumers.

Problems most often show up not as an obvious error but as symptoms:

reports from different marts start to diverge;
yesterday’s metric "dips" without a clear reason;
duplicates appear in customers, contracts or documents;
part of the data suddenly disappears after a load;
the share of "other/unknown" grows in breakdowns.

Why are these noticed late? Because failures often happen at intermediate steps (staging, normalization, deduplication, calculations), while checks are run on the final published mart. For example, the business sees a drop in sales in the daily report, but the technical cause is a changed currency code in a reference table that prevented some transactions from being recalculated.

Three types of data suffer most: reference data (statuses, branches, products), transactions (payments, shipments, requests) and master data (customers, counterparties, contracts). Reference data breaks joins and groupings, transactions cause gaps and spikes, and master data creates duplicates and "double" customers.

It’s important to distinguish a real quality problem from normal variability. Seasonality, promotions or delayed loads legitimately change numbers. A problem is a violation of expectations: unnatural spikes, impossible values, sudden changes in shares or inconsistencies across layers. Data quality control in the DWH should target those cases.

Which quality checks are needed in most warehouses

Start data quality control in the DWH with basic checks that catch most issues already at load time. They don’t require complex statistics and are understandable to the business: what broke and where to look.

Basic checks that give the most value

Usually five groups are enough:

Completeness. Mandatory fields must not be empty and the share of NULL should not suddenly grow (for example, "TIN filled in at least 99% of rows").
Uniqueness. Keys and identifiers must not be duplicated (e.g. an order should not appear twice in the fact table).
Consistency. Relationships between tables must match (e.g. every sale should reference an existing product and customer).
Allowed values and format. Types, lengths, patterns and ranges (e.g. date not in the future, amount non-negative, region code present in the reference table).
Timeliness. The load must arrive on time and the mart should be updated within expected lag (e.g. "data no older than 24 hours").

Agree in advance what counts as an error and what is a warning. An empty phone number is often acceptable, while an empty contract identifier may be a reason to stop the whole batch.

How it looks in a real example

Imagine a daily payments mart. One day the number of rows in the fact table jumps by 40% and the share of empty customer_id becomes 15%. A completeness check immediately highlights the issue, and a consistency check shows that new records don’t match the customer dimension. A common cause is a changed source export format or a new category of operations that wasn’t mapped.

Keep checks simple and measurable: percent, count, range, presence of a relation. Then they’re easy to automate and easy to explain to the data owner and the loading team.

How to formulate validation rules without extra complexity

It’s useful to separate business rules from technical rules.

Technical rules answer "did the load run correctly?" (types, formats, keys, duplicates). Business rules answer "do the data make sense for reporting?" (value boundaries, logical relations, expected shares). This separation keeps responsibilities clear: engineers fix pipeline failures, and data owners specify meanings and acceptable deviations.

To prevent quality control turning into endless debate, each rule should have a severity level and an expected reaction. Three levels are usually enough:

Blocking. Stop the load or mart publication.
Warning. Don’t stop the load, but mark the release as risky.
Informational. Record the fact for monitoring, no immediate action.

A rule should be understandable without SQL. First describe it in plain words (what is considered a violation and why it matters), then formalize: where to check (table/layer), what condition constitutes a violation, the time window and the normal threshold.

Example: "In the sales mart, daily order count must not drop more than 30% relative to the 14-day average." Formalization: check on mart layer, compare daily_count with avg_14d, threshold 0.7. Decide actions in advance: block publication on a blocking violation, or publish with a label and notify the owner for a warning.

Store rules in a catalog rather than in people’s heads. Minimum fields: owner, object (table/mart), frequency, description, check formula, threshold, severity and action on violation. Then rules are easy to review, hand over and keep context.

Step-by-step: how to implement quality control in loads

To avoid making data quality an endless project, start small and embed checks directly in the loads. It’s important that checks run automatically, not as manual reconciliations.

Practical order of actions:

Identify critical data sets: which marts and reports suffer first from errors and who consumes them. Often these are daily marts for finance, sales or regulatory reporting.
Record baseline metrics and current state. Minimum: row counts, share of empty values in key fields, uniqueness of business key, date and amount ranges. Write down the "as is" so you can show progress and avoid arguing based on impressions.
Embed checks at three points: after extraction from the source, after transformations and before loading into the mart. Input checks catch missing exports, mid-pipeline checks reveal bad joins and filters, pre-mart checks catch business-rule violations.
Define pipeline reaction to failures. For critical rules it’s better to stop the load or send data to quarantine. For less critical rules mark the batch as "suspicious" and temporarily show the report with a warning or partial degradation (for example, without new data for the day).
Introduce regular reviews. Daily — a short review of red and yellow signals; every 2–4 weeks — review thresholds and rules so they don’t become stale.

Agree on the check result format in advance: status (ok/warn/fail), number of affected records and a short reason. That simplifies automation and speeds up discussion.

Example: you load a daily sales mart. First check that all stores were present in the source (no region "dropped"), then that the share of unknown SKUs didn’t grow after joining the product reference, and finally before the mart that total sales are not negative and dates fall into the expected window. This set catches both technical failures and real business anomalies.

Anomaly alerts: what to monitor and how to set thresholds

Support when an alert can’t wait

We’ll arrange support and 24/7 maintenance for critical reporting contours.

An anomaly is not always an error. An error is a clear rule breach (a required field became NULL or a reference lookup failed). An anomaly is a suspicious shift — a sharp peak, a drop to zero, or an unusual share of missing values. Sometimes it’s a real business event, sometimes a quiet pipeline failure.

Build alerts around simple signals that are easy to explain to data owners and quick to check.

Commonly effective metrics:

row volume by table or partition (day/hour);
unique key counts and share of duplicates;
share of NULL in critical fields;
distributions (e.g. share of values by status);
key mart aggregates (revenue, order count, active customers).

Set thresholds in two ways. Static bounds fit items with clear limits: "NULL share no more than 1%" or "volume not less than 10,000 rows". Dynamic thresholds work better for metrics that evolve: compare with history (e.g. average for the last 14 days) and alert if deviation exceeds X percent or falls outside a normal corridor.

Seasonality and calendar effects matter, otherwise alerts will be noisy. For daily data, compare to the same weekday; for monthly closings keep separate baselines for month-start, month-end and holidays.

A good alert is not "something strange", but a short incident card:

what deviated and by how much;
where it happened (layer, table, partition);
when it started and which period is affected;
possible causes (if known) and quick checks;
who to notify: data owner and the loader owner.

Roles: who is responsible for quality and who fixes it

Checks alone don’t improve warehouse quality. You need people who decide on rules, respond to alerts and take fixes to completion. Otherwise quality control becomes a stream of notifications that no one owns.

Who does what

Even in a small team four roles are usually enough (often combined):

Data owner (business owner) decides what is an error and what is an acceptable exception. They set priorities: what to fix today and what can be postponed without business risk.
Data steward maintains unified definitions and reference data: how we define "customer", what a "order" is, which statuses are final. They keep the rule list and ensure rules don’t contradict.
DWH engineer implements checks and technical actions: how metrics are computed, how to mark suspicious records, quarantine and reprocessing.
Analyst or data consumer validates the impact on reports. They check that after a fix numbers are logical and that changes didn’t break common marts.

Example: an alert shows a spike in cancellations in the daily sales mart. The engineer finds a new status in the source. The data steward clarifies its meaning and updates the reference. The data owner decides whether to treat those operations as cancellations or classify them separately. The analyst verifies how this affects KPIs.

How to assign responsibility so issues aren’t "nobody’s"

A simple RACI-style agreement and clear SLAs work well:

assign an owner for each key dataset and mart (who approves rules and priorities);
record SLA for alert reaction: e.g. critical within 1 hour, medium within a day;
define communication channel and incident format: what to report, who is added, where history is stored;
agree who can change rules and reference data and how approvals work;
define a closing criterion: metrics returned to normal, reports verified, cause documented.

This makes responsibilities clear and fixes more predictable.

Fix process: from alert to incident closure

When a quality alert fires, speed and order matter. If you start investigating without an initial assessment, the team can waste hours in the wrong place while marts continue to show wrong numbers.

First estimate scope: which period is affected (single load, a day, a week), which marts and metrics depend on the problematic table, and which users might see the bad data. This helps set priority and decide whether to stop publication.

Next, quickly localize the layer where the error originated. Usually it’s one of three zones: source (incorrect or incomplete incoming data), transformation (logic error in calculations, joins, filters), or mart publication (partial load, duplicates, wrong partitioning).

Typical process:

Initial assessment: confirm the alert, list affected marts and periods, estimate reporting impact.
Diagnosis: find where it broke (source, transform or mart), collect minimal facts (table, load batch, what changed).
Temporary workaround: mark data as "do not trust", hide the metric, apply a date filter, or postpone publication. The goal is to prevent users from accepting wrong numbers.
Root cause and prevention: document what led to the failure and what changes prevent recurrence (a validation rule, an input check, load order, reference-data control).
Reprocessing and verification: backfill the needed period, reconcile control totals (sums, row counts, key metrics) and close the incident.

Example: an alert shows a drop in "issued invoices" in the daily mart. A quick check reveals finance reports have already been updated. Temporary mitigation — hide the metric for the current date and add a delay note. The cause may be a late source export or a new filter in transformations that excluded some statuses. After fixing, reprocess the day and add a check that catches such a drop earlier with a clear message: what fell, relative to what, and where to look for root cause.

Close incidents only after quality is confirmed (reconciled totals) and a preventive action is recorded. Otherwise alerts become endless fires.

Typical mistakes when building quality control

Server for DWH without guesswork

We’ll help select a GSE S200 Series rack server for loads, data marts and alerts.

Choose a server

The most common mistake is trying to check "everything" from day one. You end up with many overlapping rules that the team can’t maintain. Better to start with a small set tied to real risks: money, regulation, key reports, SLAs.

Second problem — alerts without a specific owner. When a notification goes to "everyone in the chat", it becomes nobody’s, and soon people stop reading it. Each incident type needs one responsible investigator and one business contact to confirm it’s a real issue.

Another trap is thresholds without historical basis. If you pick "+/- 10%" at random, alerts will scream during seasonality, holidays or one-off campaigns. If the threshold is too wide, you’ll miss real anomalies.

Often checks are placed only on marts when the error is already in reports. That’s OK to start, but risky: fixes become costlier and trust erodes. Move the minimal set closer to the source and pipeline where the error first appears.

Finally, many teams don’t keep a change log. Without it, no one remembers why a rule was relaxed, who approved an exception, or when to revisit it.

What helps in practice:

prioritize rules by criticality and cost of error;
assign an alert owner and a data owner to confirm issues;
base thresholds on history (at least 2–4 weeks) and account for seasonality;
run checks at the input and along the pipeline, not only on the final mart;
record rule changes and exceptions with date and approver.

Example: daily sales drop by 30%. If the check is only on the mart, business already sees the collapse. If you have an early check on incoming transaction volume and a separate check on the share of unlinked customers, you quickly determine whether it’s a load failure, a reference-data issue or a true business event.

Example scenario: anomaly in the daily mart and root-cause analysis

A sales mart updates every morning. On Monday the quality monitoring flags a 40% drop in transactions compared to the 14-day average. The integration team deployed a change to the source export the day before. This is a typical scenario where quality control should act fast and clearly.

Investigation starts simply: compare row volumes at each step. If the drop is already visible in staging, the problem is upstream of the mart. Then check breakdowns: the drop is almost entirely in one region. The cause surfaces: a new region code appeared in the branch reference that isn’t in the mapping table. Some transactions fail the join and are dropped.

A second symptom may appear: the number of unique customers in the customer report increases. It turns out the source started sending a new business key, while the DWH builds the key using the old rule. That creates duplicates and fragments customer history.

To avoid getting lost, record the data owner’s decision:

if the new region code is valid — update the reference and mapping rules;
if the customer key change is agreed — change the transformation and rebuild history;
if source changes were not approved — request a rollback or a correction at the source.

Then the team doesn’t just "fix" but documents the result: create an incident, add a validation rule (e.g. share of transactions without a branch no more than 0.5%), refine alert thresholds and record preventive measures. That way a similar failure will be detected earlier and cause less impact on reporting.

Short checklist: what to verify before going to production

DWH infrastructure project estimate

We’ll calculate configuration and delivery of equipment for your warehouse and quality-control contours.

Request estimate

Before release, ensure quality control doesn’t rely on the memory of a few people. If the on-call changes tomorrow or the source behaves differently, the system should still catch the problem and indicate next steps.

Check common things that are forgotten under deadlines:

There’s a list of critical tables and marts (those affecting reports, SLAs, regulatory extracts). Each has an assigned data owner and contact for operational questions.
For each critical dataset, the main rules are chosen (usually 5–10): completeness, key uniqueness, allowed ranges, reference consistency, absence of unexpected empty values.
Alerts are configured and tested in practice: where they go, who is on duty, expected reaction times and what to do at night or on weekends.
Actions on failure are predefined: stop the load, send data to quarantine or publish the mart with a clear risk label (and who sees that label).
Regular reviews are planned: a short weekly incident and trend review, plus a monthly threshold and rule review.

A useful pre-launch test: simulate a real failure (e.g. a missing file or a sharp drop in rows) and follow the entire process. Who got the alert? Who made the decision? Where was the cause recorded? How was it verified the issue won’t recur? If any step is unclear, fix it before production.

Next steps: how to get started and what to prepare for the team

Start with a pilot on 1–2 marts that are actively used for reporting and decision-making. That shows quickly which checks add value and where the process breaks in practice. Early agreement with data owners is crucial: which fields are critical, what error levels are acceptable and how fast an incident must be closed.

To avoid drowning in details, prepare a small set of artifacts maintainable without a separate team:

a list of 15–30 baseline rules (completeness, uniqueness, ranges, references, freshness);
an incident template (what broke, where visible, when started, who to call);
a history of alerts (to see recurring causes);
an owners table: who confirms issues and who fixes sources.

Decide where checks "live". Some are best in the load pipeline (prevent garbage from progressing), some in the DWH (integrity and relationships), and some in marts (business-important checks). A good rule: the closer you catch a problem to the source, the cheaper it is to fix.

Also consider infrastructure: checks consume compute and alert history needs storage. If monitoring fails, you lose signals and learn about problems too late.

If you need to strengthen the DWH and monitoring platform, system integration and on-prem infrastructure can help. For example, GSE.kz (gse.kz) as a technology vendor and system integrator in Kazakhstan supplies S200 Series rack servers and offers 24/7 support, which is useful for critical contours where quality control must not depend on a single person or service.