Why can’t we just update the embedding model and keep the old index?

So that new vectors are not compared against an old index. When the model changes, the geometry of the vector space changes: similar documents move to different locations and the old index ranks them incorrectly. As a result, relevance drops even if the service is technically up.

What exactly should be versioned when updating embeddings?

Version not only the model weights but everything that affects compatibility: tokenizer, vector dimensionality, data type, normalization and search metric. It’s convenient to keep a single “version card” so any vector and index are unambiguously tied to the same settings.

Which incompatibilities most often break a migration?

The most frequent and painful failure is a mismatch in vector dimensionality, which prevents the index from accepting data or makes it behave incorrectly. The second common issue is a mismatch in metric and normalization (for example, previously using cosine with normalization, but now something else), which can sharply change quality without an obvious error.

How does blue-green differ from canary in practice?

Blue-green runs two fully separate environments and switches traffic with a single setting, so rollback is fast. Canary gradually exposes a small share of requests to the new version to catch issues early without affecting everyone.

Why use shadow mode and when is it better than canary?

Shadow runs real queries through the new version "in the background" while users keep receiving answers from the old version. It’s useful when you want to evaluate quality on live data and detect pipeline issues without risking user workflows.

How to update vector search via parallel indexes without downtime?

Keep two indexes side by side: the old one serves users while the new one is built on new embeddings in the background. Switch only after verifying coverage and quality, and don’t delete the old index right away so rollback takes minutes.

What is dual write and why will the new index fall behind without it?

Because data keeps changing while the new index is built. Dual write means that new and updated documents are written to both indexes using the same id contract and deletion rules; otherwise the new index will lag and comparisons will be unfair.

What minimal quality checks should be done before switching traffic?

Start with 50–200 real queries from logs and add several “problem” examples like short phrases, typos and language mixing. Compare whether the expected document appears in top-1 or top-5, how often there are empty results, and whether response time has worsened.

Which metrics and thresholds should be used as stop signals in a canary?

Predefine simple stop rules tied to users: growth in empty results, rise in errors and timeouts, noticeable latency degradation, worsening clicks or more reformulations. If a threshold triggers, freeze rollout and return traffic to the old environment with a single command.

What mistakes are most often made when updating embeddings without downtime?

Don’t mix embeddings of different versions in the same collection and don’t change metric or normalization “on the fly” without recording the version. Another common mistake is a formal rollback plan: if returning requires manually fixing several services, it will almost always be delayed under stress.

Updating embeddings without downtime: blue-green

Goal: update the model and embeddings with no downtime

When you change the embedding model or recompute vector representations, it breaks more than one component — it can break the whole chain. Vector search starts comparing new vectors against an old index (or vice versa), and results become unpredictable. Even if the service is technically up, users quickly notice: “it used to find it, now it doesn’t.”

Downtime is especially dangerous where search and suggestions are part of daily work. This can be internal policy search, service desk tickets, finding similar documents, or assistant responses in chat. If such a tool is unavailable even for an hour, people revert to manual search, duplicate requests, support load rises, and trust in the system falls.

Typical symptoms of a careless model or embedding swap:

Relevance drops: the top results are "similar by words" rather than by meaning.
Empty answers appear: the system returns nothing where it used to find results.
Latencies increase: the index warms up, caches are flushed, requests slow down.
Result distributions change dramatically: identical queries return a "different world" without a clear reason.

An update touches several layers: model version, text preprocessing pipeline, embedding recomputation process, vector index and the serving layer (for example with access filters and mixed ranking: vector and keyword).

The problem is not only that recomputation may take hours or days. Often incompatibility breaks things: the old index may not accept new vectors, and the new model may "see" text differently. So the goal looks like this: keep the service available, achieve predictable quality, and have a fast rollback if something goes wrong after switching.

Organizations with strict continuity requirements (e.g., government or banks, often deployed on on-prem hardware) cannot simply "upgrade at night and see what happens." You need strategies that let you update model and embeddings in parallel with the running system and verify effects before all users see changes.

Component map: what to version

To update without downtime, it helps to sketch a simple flow map first. The error is often not in the model itself but in one piece changing while neighboring pieces stay old. The system then starts answering strangely.

Typical request path: source data (documents and queries) - embedding model - vector index - nearest neighbors search - (sometimes) additional reranking - user response. It’s important to version not only the model but everything that affects compatibility between steps.

What to version explicitly:

Embedding model (weights, name/hash of the version) and operating mode (e.g., whether vector normalization is enabled).
Tokenizer or vocabulary (even a small change can shift representations).
Embedding build parameters: dimensionality, data type (float32/float16), pooling, postprocessing (L2 normalization).
Search settings: distance metric (cosine/dot/L2), index parameters (e.g., number of clusters/graph links, efSearch), filters and deduplication rules.

In practice it’s convenient to keep a single “version card” alongside the model and index: e.g. embedding_model=v3, tokenizer=v3, dim=768, metric=cosine, normalize=true, index_build=v3.1. Then any embedding and any index are unambiguously tied to the set of settings.

Critical dependencies that most often break migrations:

Vector dimensionality. If the new model yields 1024 instead of 768, the old index won’t accept those vectors.
Metric and normalization. Cosine typically assumes normalized vectors, while dot product behaves differently.
Storage format. Moving to float16 or quantization can speed search but changes quality and distance distributions.

The riskiest operations are those that are hard to roll back quickly. Reindexing takes time and money, while switching traffic instantly affects everyone. Therefore "dangerous" steps should be designed so the old chain (model + embeddings + index + search parameters) can continue to run in parallel and take traffic back at any moment.

Deployment patterns for zero downtime: blue-green, canary, shadow

The main rule is: don’t force staff to be the first to test a new version during business hours. Three approaches are commonly used. They differ in how you route traffic to the new version and how fast you can go back.

Blue-green

Blue-green uses two identical environments: “blue” (current) and “green” (new). You stand up the new model, new embeddings and everything around them in a separate environment, run checks, then flip traffic with one switch (router, load balancer, or a feature flag).

Pros: fast rollback (flip back), clear responsibility (only one version is active at a time). Cons: more expensive in resources since you must maintain two full copies.

Canary and shadow

Canary gradually introduces the new version by traffic share. Start with 1–5% of requests, then increase until 100%. Errors appear early and risk is distributed.

Shadow is a “background” check: the new version receives a copy of requests and records answers, but users keep getting responses from the old version. This lets you collect statistics on live data without affecting user experience.

Typical choice:

Blue-green — when you need fast rollback and can afford double resources.
Canary — when you prefer gradual risk reduction and can monitor metrics at each step.
Shadow — when you want to validate quality on real queries without changing user responses.
Canary + shadow — when you first compare in the shadow and then cautiously route a portion of traffic.

Practical example: for internal knowledge search, the new model may look better on ideal tests, but staff type short queries with typos in reality. Shadow helps reveal that the new version misses such queries more often without breaking work.

To make any approach safe, prepare three things in advance:

A clear switch: a single parameter that controls where traffic goes.
Fast rollback: a tested team or button, without manual edits.
A stop threshold: a simple rule to halt rollout (e.g., increased errors or drop in clicks on results).

Parallel indexes: how to migrate vector search

The safest way to update vector search is to keep two indexes side by side. The old index continues serving users while the new one is built in the background on new embeddings. This reduces the risk of breaking search for staff.

The idea is simple: while traffic goes to index v1, you build v2 and bring it to the same data coverage. Switch only when v2 demonstrates comparable or better quality.

Dual write: how to not lose new data

While the new index is being built, data keeps changing: documents are added, cards updated, new tickets appear. To prevent v2 from lagging, enable dual write: every new or updated object receives an embedding and is written to both indexes.

A single "write contract" matters here: identical ids, identical deletion rules and identical text normalization. That greatly simplifies verification.

Backfill: loading history and tracking progress

After enabling dual write, backfill history into v2: recompute all old documents with the new model and insert them into the new index. Track not only percent complete but also gaps.

Practical minimum checks:

Counter of processed objects and remaining tail
Share of errors by type (timeouts, empty text, wrong encoding)
Spot checks: do top documents have ids in v2?
Indexing time and backfill speed

To avoid confusion, set naming rules and stick to them: e.g. search-index-v1, search-index-v2, plus build date or release number in metadata. Then switching and rollback are unambiguous.

Step-by-step migration plan: from preparation to switch

Workstations for the ML team

We will equip your team with workstations and PCs for development, testing and search QA.

Request proposal

It’s critical to separate old and new versions in advance and agree what counts as success. Then switching becomes a controlled operation.

1) Prepare the new version as a separate environment

The model, text preprocessing pipeline and vector format should share a single version. This protects against a situation where the model is updated but tokenization or normalization stays old and quality "drifts" for unclear reasons.

Decide in advance which data you will recompute (all documents or only active ones) and how you will tag objects: document_id, embedding version, recompute timestamp.

2) Build the new index in parallel and check consistency

Bring up the new index next to prod without touching production. First run embeddings recomputation in the background, then build the index and check basics: document count, share of empty/corrupted vectors, metadata matches.

A simple practical test: take 50–100 typical staff queries (knowledge base, tickets, policies) and ensure the new index finds the needed documents rather than random ones.

A 4-step workflow:

Lock pipeline version and prepare a test query set (manual and from logs).
Recompute embeddings and build the new index in a separate environment.
Send queries to old and new environments in parallel (shadow) or to a small user share (canary).
Switch main traffic but keep time and mechanism for a fast rollback.

3) Run shadow/canary and collect signals

Shadow is handy when you don’t want to change user responses but need to compare quality and metrics in the background. Canary is useful when you need to observe real behavior: clicks, time-to-document, number of reformulations.

4) Switch and keep a rollback window

Flip traffic quickly, but do not delete the old index immediately. Keep a clear "kill switch" for return: configuration flag, versioned routing or separate endpoint. This reduces the risk of disrupting work if a rare but important case appears after migration.

Quality control: what to measure before and after migration

Migrations more often fail due to silent quality degradation than a hard outage. Therefore record a baseline before the update and compare after.

Start with a set of test queries that reflect real staff tasks. Not "pretty" examples, but what people actually type: short phrases, typos, mixed languages, internal names. Practice: gather 50–200 queries from logs and add 10–20 manually selected "painful" cases.

Offline checks: quality without users

Offline tests help understand what changed due to the model and index. For each test query, record expected documents (or at least acceptable options) and compare old vs new.

Common measures:

Share of queries where the needed document is in top-1 or top-5 (recall@k).
Average rank of the correct result.
Stability: how often answers "jump" for the same query.
Share of empty results and share of irrelevant top results.
Search time and time-to-first-result.

Manually review 20–30 queries: people quickly spot semantic substitutions that metrics sometimes miss.

Online signals: what users perceive

After enabling the new version, watch behavior, not only accuracy. If people find it harder to get answers, it shows in clicks and time.

Monitor clicks on results, time-to-first-click or document open, number of reformulations, share of queries without clicks, and support tickets about search. Define thresholds in advance. Example: stop rollout if recall@5 drops by >2–3 percentage points, share of no-click queries rises >5% vs baseline, or time-to-result worsens >10–15%. Such rules remove the "it feels worse" debate.

Observability: metrics and alerts to catch degradation

S200 servers for vector search

We will choose rack servers GSE S200 for indexing and fast local-search on on-prem hardware.

Select

The biggest risk is not a crash but silent degradation: results get worse, searches return empty more often, people spend more time and switch to manual workarounds. Prepare observability before switching traffic.

What to measure: from infrastructure to user behavior

Collect metrics so you can compare old and new side by side: same dashboards, same time windows, separate labels for model and index version.

Minimum set:

Technical metrics: p95 latency on key endpoints, share of 5xx and timeouts, CPU/GPU load, memory, processing queues.
Vector search metrics: search and query-prep time, index size and growth rate, percent of empty results, share of queries with very low score.
Streaming quality signals: CTR on suggestions, share of reformulations, average attempts to click, abandonment rate.
Business signals: support tickets, chat complaints, rise in manual operations.

Agree on a baseline: without it alerts will either be silent or noisy.

Alerts: what should fire automatically

A good alert answers “do we need to wake the on-call now?” Keep them simple and tied to user impact.

Rules that often save migrations:

p95 latency increased by X% vs blue in the last 10–15 minutes.
Percent of empty results exceeded threshold and held for Y minutes.
5xx or timeouts exceeded threshold and retries increased.
Sudden memory usage spike or disk exhaustion on index nodes.
A business-stop: surge in tickets or cancellations beyond normal.

Include context in the alert: model version, index version, region, traffic share and recent actions (switch, warm-up, rebuild).

Example: after enabling green on 20% traffic p95 latency didn’t change but empty results rose from 1% to 6%. This often points to an embedding pipeline issue (different normalization, wrong language, wrong encoder) or search filters. Such an alert should trigger: stop ramping, return users to blue or temporarily route part of queries to the old index.

Observability only works if someone acts on it. Assign metric owners for the migration window and prepare a short playbook: what to check first and who rolls back if degradation is confirmed.

Common mistakes and traps when updating embeddings

Small mismatches that only show after switching are frequent failure causes. When the goal is zero downtime, these errors hit staff hard: search stops finding needed documents, suggestions get odd, support load rises.

Technical mismatches that break search

First trap — vector dimensionality mismatch. The model may move from 768 to 1024 dims while index or client code still expects the old format. Best case: it fails loudly; worst case: it silently returns garbage.

Second — mismatch between metric and normalization. If you used cosine with normalized vectors before, but the new version switches to dot or L2 without adjusting parameters, quality can drop sharply without an obvious error. Different vector DBs also interpret “cosine” differently: sometimes it’s a separate metric, sometimes dot product + mandatory normalization.

Third — mixing embeddings of different versions in one index. While tempting, this makes nearest neighbors random. Similar problems occur when some documents are reindexed with the new model and others remain old in the same collection.

Planning and control mistakes

Underestimating reindexing time almost guarantees peak problems. Reindexing stresses CPU, disk and network and creates queues in the pipeline. If production queries run concurrently, you may see quality degradation (due to timeouts) and performance degradation.

Rollback plans are often formal: “we’ll revert if needed.” You need concrete stop signals and actions. Examples:

quality fell below threshold (clicks, manual top-5 review)
p95 search latency doubled and held for 15 minutes
share of empty results or errors exceeded X%
load on DB/queue breached safe limits

Another trap — not checking clients that cache results or hold long connections. After switching some users may still operate on the old version for minutes, producing mixed behavior that’s hard to diagnose.

Practical example: you update search over internal instructions. The new model is better for long regs but worse on short part numbers and product names. Staff will copy queries, open more pages and ask colleagues more often. You’ll notice this before average score metrics change.

To avoid these traps, lock compatibility (dimensionality, metric, normalization), don’t mix versions in one index, plan reindexing as a separate load and keep a quick rollback with clear stop thresholds.

Short checklist before switching

Infrastructure for blue-green releases

We will select servers and a deployment scheme to update vector search without downtime.

Discuss

Before the final flip, ensure you control not only the model but everything around it. Most failures come from small things: different tokenizer, index built with wrong params, limits not ready, no fast rollback.

What must be ready before switching

Confirm versions are recorded and clear to all participants. In the release card (or a simple doc) list: model version, tokenizer version, text preprocessing scheme and index version. If embedding params changed (dimensionality, normalization, language), record that too.

Ensure the new index is fully built and validated on control data. This means more than "index built without errors": search results on a selected query set should look expected. Ideally have 50–200 reference queries and documents that reflect real staff work.

Verify the safe launch mode: canary or shadow. In canary a share of traffic goes to the new chain; in shadow the new chain gets query copies without affecting answers. In both cases a dashboard with clear metrics should let you spot a problem within 5 minutes.

What’s required for rollback

Rollback must be as simple as the switch. If returning requires "manually rebuilding the index" or "rolling back several services in the right order," it becomes a long operation.

Traffic switch is ready (quickly revert to the old version without a deploy)
Old index preserved and accessible (not deleted or overwritten)
Load limits set for new environment (QPS, timeouts, queue sizes)
Rollback criteria defined (error increase, quality drop, latency spike)

Test real staff scenarios

Last step — run real workflows: e.g., an employee searches last year’s contract by title, an engineer looks up a specific equipment instruction, an accountant searches for a template phrase. Test the exact sequences people use every day, not just "search in general."

If you operate in sectors requiring high reliability and 24/7 support (government, healthcare, finance), make this checklist mandatory.

Example scenario and next steps

Scenario: corporate search over documents and tickets

An internal portal lets staff search policies, contracts, instructions and support tickets. Search is embedding-based: the user types a query, the system finds semantically similar fragments and shows results. The team wants to move to a new model because the current one handles short queries like “password reset” or “procurement report” poorly.

Goal — perform the update so the service stays available and quality doesn’t collapse for typical queries.

Risk-free migration path

A practical route: parallel index + shadow, then canary. Old and new environments live side by side and are compared on real queries without affecting users until you’re confident.

Typical plan:

Bring up the new index alongside the old and start background recomputation of documents, versioning model and settings.
Enable shadow: the old index serves users while the new one receives the same queries in the background and logs results and metrics.
Pick 20–50 typical queries (by role/department) and record an "oracle": what counts as a good answer and which documents should be on top.
Run a canary: 1–5% traffic (or one user group) gets results from the new index while others remain on the old one.
After stable metrics, increase share and switch main traffic, keeping a fast rollback.

Pause between steps for checks. Speed matters less than quality control.

Catching problems early

Often global metrics look fine while relevance drops for several critical scenarios. For example, the new index may surface pretty but wrong documents, or it may be worse on queries for support guidance. Shadow logs and manual runs of control queries before widening canary usually save the release.

The most telling signals: falling quality on typical queries, more clicks on irrelevant results, more reformulations and fewer "successful" sessions (finding needed info in 1–2 attempts). Shadow logs and manual checks are often the first line of defense.

Next steps

To make these updates repeatable:

Enforce mandatory versioning: model, embedding params, index schema and the control query set.
Automate reindexing and switching so rollback takes minutes, not hours.
Agree on canary stop conditions (which metrics and by how much they may worsen).
Allocate resources for parallel indexes and recomputation load in advance.

If the update is limited by infrastructure (resources for two contours, reliable storage, monitoring, 24/7 support), consider hiring a systems integrator. In Kazakhstan this is part of what GSE.kz (gse.kz) does: they have experience supplying locally produced S200 servers and building infrastructure for enterprise IT systems where downtime is unacceptable.