Apr 17, 2025·8 min

Zero-downtime database migration: replication, switchover, rollback

Zero-downtime database migration plan: choosing replication, preparation, performance tests, safe switchover and a clear rollback plan.

Zero-downtime database migration: replication, switchover, rollback

Goal and success criteria: what counts as downtime

Shutting down a database usually hurts the business more than you think. Sales and requests are lost, employees can’t work with CRM/ERP, tills and warehouses freeze, and customers see errors. Even if the main application restarts quickly, reports, accounting exchanges, payment integrations and external services often break. Before migration, agree on what exactly counts as downtime and what you’re willing to tolerate.

“Zero-downtime database migration” in practice does not mean “nothing changes,” but rather “users continue to perform critical operations, and the risks of data loss and recovery time are limited in advance.” Small degradations may be acceptable: reports run slower, background jobs pause temporarily, data marts update with delay. But it’s best to list what must never be turned off (for example, payment acceptance, order checkout, patient records, authentication).

Before starting the migration, list the systems affected: the main application and API, reports and BI, integrations and queues (for example, 1C exchanges, ESB, external partners), background jobs, admin panels and monitoring.

Describe success criteria using RPO and RTO in plain language:

  • RPO (Recovery Point Objective) — how much data you can afford to lose in the worst case. For example: “no more than 30 seconds of changes” or “no data loss allowed.”
  • RTO (Recovery Time Objective) — how quickly the system must return to normal operation after a switchover or failure. For example: “all key operations available within 5 minutes.”

If these figures are agreed with the process owner and IT, it becomes easier to choose replication, inspection windows and to honestly assess whether a switchover can be done without noticeable impact on users.

Inventory: what to collect before starting

A zero-downtime database migration starts not with replication but with exact baseline data. If you don’t know what you’re moving and how it’s used, you cannot estimate risks or choose the switchover method.

Collect a list of databases and their parameters: engine and version, on-disk size, number of tables, large objects (indexes, partitions, LOBs), and growth rates for the last 3–6 months. Note separately what grows fastest: data, indexes, transaction logs, temp files.

Record the load profile in numbers, not generalities. Usually you need peak RPS/QPS, read/write ratio, the heaviest queries, average latency and the 95th percentile. If a system runs 24/7, you may find there is no “quiet” time, which affects both the switchover moment and replication settings.

Next — dependencies. A database rarely stands alone: applications, background workers, queues, ETL, reports, and external integrations. It’s useful to create a short registry so you don’t forget an invisible consumer that will break on migration day.

A minimum document should include:

  • databases and versions, sizes, growth, disk requirements;
  • load peaks and critical queries;
  • dependencies and connection points (connection strings, service accounts);
  • backups: types, schedule, storage location, recovery time;
  • constraints: change windows, access, regulatory requirements and audit.

Separately, check current failure points and the real readiness of backups. A simple test: pick one database and perform a test restore to a separate stand while measuring the time.

Finally, clarify organizational constraints: who grants production access, who approves changes, how long certificate issuance or port openings take. These small details often shift schedules.

Choosing the approach: replication and switchover strategy

To achieve a zero-downtime migration, first choose the replication type and understand how the switchover will occur. Mistakes here often lead to long write freezes or unexpected incompatibilities.

Physical replication copies the database at the file and transaction-log level. It’s usually simpler and faster, suitable for large volumes and standard high-availability scenarios, but often requires similar DBMS versions and the same storage type.

Logical replication applies changes at the table/operation level. It’s more flexible for version upgrades, partial migrations or schema changes, but harder to configure and more sensitive to edge cases (triggers, sequences, DDL).

Replication can be uni-directional or bi-directional. Uni-directional is simpler and safer for migration: one source of truth and one target copy. Bi-directional helps when you need to write to both sides temporarily, but it greatly complicates conflict handling. A separate pattern is read from replicas: the application reads from replicas but writes to the master. This reduces load on the source, but during migration you must account for replication lag.

Before choosing an approach, answer practical questions:

  • Do you need to change DBMS version or storage type (local disks, SAN, different SSD classes)?
  • Are there heavy writes (batches, reports) that spike the transaction log?
  • Is read lag on replicas (seconds) acceptable for users?
  • Can you ensure identical configuration, including encodings, collation and timezone settings?

Network often becomes a hidden bottleneck. Stable replication requires predictable latency, enough bandwidth and encrypted traffic. If the move is between sites, measure RTT in advance, check MTU, set priorities and ensure the maintenance window isn’t eaten up by retries.

Assess switchover complexity by details:

  • how many applications and integrations hold direct DB connections, and can their parameters be changed quickly;
  • whether background jobs and queues must not run twice;
  • how quickly you can stop writes and wait for the replica to catch up (RPO close to zero);
  • whether you can do a trial switchover in a test environment with similar load.

In practice: if you deploy a new site and also upgrade the DBMS version, teams often choose logical replication or a hybrid approach. If versions match and speed matters, physical replication usually gives the shortest, clearest switchover.

Step-by-step migration plan: from prep to switchover readiness

Start with the target environment. It must handle the same load as the current DB: CPU, memory, disk (IOPS and latency), network, backup and monitoring. Also check access rights: application accounts, secret access, firewall rules, admin roles and who has permission to perform the final switchover.

Next, set up replication and perform the initial synchronization. Decide in advance which mode is needed: asynchronous (usually easier over distance or under high load) or synchronous (lower risk of data loss but higher network requirements). During the initial copy, watch for filling disks with logs or degrading primary performance.

When replication catches up, move to parallel runs. The ideal is to let the new DB take a “shadow” load: move some reads, replay typical operations through a test harness, or, if possible, duplicate writes at the application level. The goal is to catch index, locking, memory and response-time issues.

Before switchover readiness, close consistency questions. Check schema, extension versions, collation/timezone settings, all necessary jobs, triggers and permissions. For data, use checksums on key tables and selective business metric comparisons (for example, stock balances, order statuses, account balances).

By this stage you should have a short role plan:

  • who runs the command channel and records event times;
  • who is responsible for replication and lag;
  • who checks the application and key scenarios;
  • who decides go/no-go and announces the switchover start;
  • who monitors metrics and errors.

If infrastructure is 24/7, agree on a “heightened attention window” and a contact list (including support and the product owner). This reduces the risk of pauses on the day of the switch.

Pre-switchover tests: functionality and performance

S200 servers for databases
Choose high-performance GSE rack servers for critical databases and replicas.
Select server

Before the switchover you must prove two things: the new database behaves like the old one, and the system won’t become slower. With a zero-downtime goal there is almost no time for post-mortems.

Functional tests: what must pass

Start with the most frequent and highest-risk operations. Test what actually hits the business: user login, creating and modifying key documents, closing a period, printing/exporting, background jobs.

A good practice is to run the same sequence of actions on the old DB and on the replica, compare results (numbers, statuses, amounts) and record discrepancies. If you have reports, verify at least 2–3 reference reports on the same data slice.

Performance tests: metrics and thresholds

Measure performance with clear metrics, not impressions. Agree on thresholds that would block the switchover:

  • response time for 5–10 critical queries and operations;
  • p95 response time during peak (for example, from application logs);
  • CPU and disk load on the test DB server;
  • growth of queues, locks or waits (if you measure them);
  • network throughput between app and DB.

Also verify replication lag. If lag grows under load, you risk data loss on failure or during a hard switchover.

Run a short fault-tolerance test: simulate a node failure (or service stop) and check how quickly replication recovers and what the application sees. Prefer doing this on a stand that resembles production.

Finally, record results and sign readiness:

  • list of tests and actual numbers;
  • agreed thresholds and their status;
  • current replication lag and estimated RPO/RTO;
  • known limitations (what was not tested and why);
  • the decision “ok to switch” and who owns it.

Switchover without downtime: procedure and control

The switchover is a brief moment when the main workload starts running on the new database. To avoid surprises, lock down success criteria and what level of change is acceptable during the switch.

Most important is to introduce a controlled freeze. This is usually not a full ban but a managed window: temporarily stop risky operations (bulk loads, heavy reports, schema migrations), freeze application versions and forbid manual admin edits.

Switchover sequence

The logic is simple: protect writes first, then move reads, and only then restore background work.

  • Put applications into a controlled mode: stop background jobs, queues and integrations that write to the DB.
  • Let replication catch up to zero lag and record the moment (time, LSN/GTID, log number — whatever your DBMS uses).
  • Move writes to the new DB (or promote the new primary) and verify new transactions appear only there.
  • Switch reads: read-only services, reports, cache warming, then user queries.
  • Re-enable background jobs one by one, starting with the safest, and monitor load increases.

Connection reconfiguration should be predictable: DNS/virtual IP, a single connection string, identical accounts and permissions. If you use connection pools, set short TTLs and ensure the app reconnects without manual restarts.

First-hours monitoring

In the first 1–2 hours check metrics every 15 minutes and log them in the shift journal.

  • Application errors: increase in 5xx, timeouts, transaction or lock errors.
  • DB performance: p95/p99 latencies, queue lengths, CPU/IO, lock contention.
  • Data integrity: counts of key entities, freshness of critical tables, correctness of background processing.
  • Replication/backups: status of any reverse replication (if enabled), backup readiness.

Consider the migration complete when load is stable, errors returned to baseline, key business operations succeed, and the source is either marked as a standby or clearly isolated from writes.

Rollback plan: what to do if something goes wrong

Rollback is not a failure but a pre-agreed way to quickly return the system to stability. A good rollback plan works only when triggers, an owner and a clear path back are defined in advance.

Fix rollback triggers before the switchover and don’t decide them on the fly. Typical triggers are measurable: growth of application errors, exceeded latency thresholds, data divergence between source and replica, inability to complete a key operation (login, payment, order creation), or failed backups on the new side.

The reverse switchover should be as carefully planned as the main one. Usually it means switching the write endpoint back (app config, load balancer, DNS, a feature flag) and immediately blocking writes to the new DB to avoid double writes.

If rollback is necessary, the main question is what to do with changes already written to the new DB. Basic options:

  • temporarily freeze critical operations (e.g. document creation) to stop divergence growth;
  • save changes from the new DB into a separate journal (table, export, message queue) for later replay;
  • record a point-in-time/LSN and preserve logs so events can be restored carefully later.

One responsible person (the on-duty shift lead) should make the rollback decision. Record time, reason, current switchover point, what has been done and what the team is forbidden to do (for example, run schema migrations).

After rollback, perform a correctness check. At minimum: the app writes to the old DB again, key user scenarios pass, background jobs run, metrics return to normal, and changes during the switch window are either absent or stored in a journal for later transfer.

Common mistakes and pitfalls in DB migrations

Database infrastructure sizing
We will pick servers and configuration for IOPS, memory and 12–24 month growth.
Request sizing

Even with replication configured and a seemingly simple switchover, problems often arise around the database. In zero-downtime moves, a small oversight or “we’ll check later” frequently becomes an incident.

Common mistakes:

  • Ignoring dependencies: queues, schedulers, reports, external integrations, cache, file stores. The result: “the app works” but payments fail or reports don’t build.
  • Switching without measuring replication lag. If lag was 30–60 seconds, you lose recent writes or get inconsistencies.
  • Testing informally but not under peak load. The new server may be fast in the morning but time out at 11:00 under real load.
  • Checking access too late: application user rights, subnet access, certificates, firewall rules, DNS, encryption settings. Last-hour fixes are the riskiest.
  • Not assigning owners or communication procedures. When things go wrong, it’s unclear who stops the switchover, who rolls back, and who notifies the business.

A good way to catch these traps is a short D-day rehearsal on a test stand: measure replication lag, run typical user operations and simulate a peak (parallel queries, reports, nightly jobs). Even if infrastructure is supplied by a vendor, you still must verify network, access and real load profile.

To reduce risk, follow simple rules: list dependencies and owners, record replication metrics and an acceptable maximum before switching, run a load test on the target stand, and verify access and certificates several days before—not on the day.

Short checklist for switchover day

On switchover day time goes fast and errors usually come from misunderstandings rather than technology. This checklist helps ensure a calm zero-downtime migration.

First, confirm business expectations. RPO and RTO should be concrete numbers and agreed in writing: how much data can be lost and how long the system can run in switch mode.

Then check the organizational side: every dependency must have an owner and a way to contact them. Often a “minor” report that hits the old DB hourly breaks the whole plan.

Before the day, mark these items as done:

  • RPO and RTO agreed, maintenance window and communication rules set;
  • dependency list (apps, integrations, reports, schedulers) compiled, owners assigned and available on switchover day;
  • monitoring and alerts enabled in advance on both sides: replication lag, errors, load, free space, response time;
  • performance tests run on the target side, thresholds fixed (for example, “login within 2 seconds”, “report within 30 seconds”);
  • rollback plan documented and rehearsed in a test environment.

A final check is a short dry run of the scenario for 10–15 minutes: the team reads the sequence aloud and control points. Who confirms the replica caught up, who switches the app, who watches post-switch metrics, who communicates with users.

Example scenario: migrating a 24/7 system

Turnkey zero-downtime migration
We will help plan replication and switchover according to your RPO and RTO.
Submit a request

Imagine a company whose CRM and billing operate round-the-clock. Sales and support requests arrive during the day, and accounting operations and payments peak in the evening. You can’t stop the DB even for 15 minutes: errors in integrations and growing job queues follow.

They choose replication: stand up a new DB on the target site and configure continuous change streaming from the current DB. Parallel testing runs while users keep working on the old system. The switchover is scheduled overnight when load is minimal but the team is on call.

How to test the copy without affecting users

Run tests against the new DB copy (or a read-only replica) to avoid extra load on the production database. Use a fresh data slice and run typical sequences: find a client, create a deal, issue an invoice, process payment, export reports.

To reduce surprises:

  • run load jobs in low-activity windows;
  • check the heaviest reports and background jobs;
  • compare results of key queries: response time and lock counts;
  • test integrations: payment gateway, telephony, email, accounting exchanges.

Decision to switch based on metrics

Switch decisions are data-driven: for example, replication lag is within a few seconds, critical operation response times are no worse than current, no errors in logs, and a night load test shows capacity headroom in CPU, disk and network.

Just before switching, pause writes briefly (usually minutes): temporarily block data-changing operations, wait for replica catch-up and record control metrics (row counts in key tables, checksums, queue statuses).

Rollback is equally concrete: if errors increase or speed drops sharply, point the application back to the old DB and stop writes to the new one. With good preparation this may take 10–20 minutes: revert connection configs, ensure users are writing to the old DB again, and record which data made it into the new system.

Next steps: preparing the project and infrastructure

A zero-downtime migration often hinges on coordinated actions. The business owner must confirm which operations are allowed on switchover day (for example, temporarily banning bulk loads) and which metrics constitute success.

Form a working group and agree rules of communication during the work: one channel, one decision owner, record changes as they occur. Typically you need a DBA, infrastructure engineer, developers, ops/support and a business owner.

Prepare a single shared document that lives before and after the switchover: minute-by-minute schedule, control points, “ok to switch” criteria, metrics (replication lag, error counts, p95 response time) and a rollback plan with thresholds. If rollback is harder than switching, the plan is still unfinished.

Rehearse the first run on a test stand: go through the full scenario including changing app configs, verifying permissions, warming caches and returning to the original state.

Choose infrastructure for target load and projected growth over 12–24 months: CPU, memory, disks sized for IOPS and latency, network, separate channels for replication, backup and restore. If you’re building a new site for critical services, it’s often convenient to source both hardware and integration from the same provider— for example, via GSE.kz (gse.kz), which in Kazakhstan supplies servers and provides 24/7 integration and support services.

FAQ

What does “zero-downtime database migration” actually mean in practice?

By “zero-downtime” people usually mean that users can continue to perform critical operations and that acceptable data loss and recovery time are agreed in advance. Minor degradations like slower reports or a delay in data marts are sometimes acceptable if negotiated beforehand.

How do we agree what counts as downtime and what can be temporarily degraded?

Start by listing critical operations that must not be stopped and, separately, what can be temporarily limited. Then record these as success criteria together with the business owner so there’s no debate on the day of the switchover about whether the system is “working.”

How can I explain RPO and RTO simply to the team and the business?

RPO is how much data you are willing to lose in the worst case, e.g. “no more than 30 seconds of changes.” RTO is how quickly the service must be back to normal after a switchover or failure, e.g. “key operations available within 5 minutes.” These two numbers directly affect the replication choice and how strict the switchover must be.

What data should we collect before migration so the plan doesn’t fail?

At minimum, understand what you are moving and how it’s used: DBMS versions, sizes and growth, load profile, dependencies, backups and constraints on changes. Missing even one hidden consumer like a report or background job usually breaks the plan on the migration day.

When should we choose physical replication and when logical?

Physical replication is usually faster and simpler when versions and environment are similar and you need fast catch-up for large volumes. Logical replication is preferable when changing DBMS versions, moving only parts of data, or changing schema—it’s more flexible but requires more careful handling of triggers, sequences and DDL.

Why does the network often become the main limitation for replication?

The network affects replication lag and switchover predictability: unstable RTT and packet loss turn into lag and retransmits. Before starting, measure latency and bandwidth, check MTU, and ensure encryption and security rules won’t unexpectedly throttle traffic.

What tests should we run before switchover to avoid risking production?

To prove readiness, compare results of key operations on the old and target systems and set numeric thresholds for response times and errors. Also test replication lag under load—"everything is fast in the morning" doesn’t guarantee stability at peak.

What is the safe sequence of actions for a zero-downtime switchover?

Usually you introduce a controlled freeze of risky operations, wait for zero lag and record a checkpoint, switch writes to the new database and then switch reads and background jobs. The most common failure is background jobs or integrations writing to both sides or silently continuing to write to the old DB.

When should we roll back and how to avoid making things worse?

Rollback triggers must be measurable and agreed in advance, e.g. growth of errors, exceeded latency thresholds, data divergence or inability to perform a key operation. On rollback, quickly return the write endpoint and immediately block writes to the new DB to avoid double writes and complex reconciliation.

Who should we contact to prepare infrastructure and run a turnkey migration?

If you also build a new site, you usually need servers, storage, network, monitoring and support, plus people to rehearse and execute the minute-by-minute plan. In Kazakhstan such projects are often handled by a system integrator that can supply servers, assist with infrastructure and provide 24/7 support, for example GSE.kz.