What's the problem at the platform boundaries

Performance issues almost never live in a single place. The user sees a simple picture: the system takes a long time to open, reports build slower than usual, everything is sluggish in the morning. But behind one complaint there are often several causes: the server lacks resources, the virtual environment has redistributed load, and the DBMS is waiting for disk or memory at the same time.

Because of this, disputes usually arise between teams. Some say the hardware is fine. Others don't see problems on the virtualization host. DBAs show that queries have become heavier or the database lacks IOPS. Each side may be right on its own, but there is still no overall picture.

Then the search for the cause quickly turns into a search for blame. While each team looks only at its own charts and logs, the conversation goes in circles. The business doesn't care exactly where the bottleneck is — it cares about one fact: the service is slow and resolution times are growing.

Most often such disputes drag on for three reasons: teams use different metrics, there was no agreed verification procedure in advance, and there is no shared standard for what counts as acceptable performance. Without those agreements, today people suspect the storage, tomorrow the hypervisor, then database settings. Each new check wastes time while the system remains under load.

This happens especially where servers, virtualization, DBMS, and application software are managed by different contractors. That's why it's important to agree not with a vague phrase like "the integrator is responsible," but in steps: who confirms degradation, who collects data, who decides the next check.

This is particularly important for large deployments in the public sector, banks, education, and healthcare. In such projects infrastructure has multiple layers and downtime is costly. If rules aren't set in advance, even strong teams will argue longer than it takes to fix the root cause.

Which roles are needed from day one

If a project includes servers, virtualization, and a DBMS, responsibility for performance cannot be postponed. Otherwise, at the first slowdown each team will see only its own area and the business will get a long dispute instead of a resolution.

It's better to name roles by function, not by company names. Even if one contractor manages some infrastructure and another handles the software, responsibility boundaries should be clear without caveats.

At minimum you need five roles:

owner of servers and physical resources — responsible for CPUs, memory, disks, and network at the physical infrastructure level;
virtualization owner — responsible for the hypervisor, resource allocation, redundancy, and workload migration;
DBMS owner — responsible for database settings, indexes, query plans, and maintenance jobs;
business owner — confirms what performance is considered normal and which operations are critical;
incident coordinator — gathers facts, assigns steps, and prevents the discussion from fragmenting into separate threads.

The last role is especially important. Without it even strong specialists stall: the server team looks at the host, the virtualization admin checks limits and oversubscription, the DBA sees heavy queries, and the business repeats the same thing: "everything is slow in the morning." You need one person who assembles the whole picture.

It's important to record in advance who makes decisions and who only provides data. A DBA can suggest changing DBMS settings, but if that affects resilience or redundancy, the platform owner must approve the change. If a system integrator or server platform vendor participates in the project, that doesn't change the main rule: each layer must have a named person with a defined response time.

You can spot a mature project by a simple sign. To the question "who checks this first?" there is a fast, clear answer. Not "vendor," not "IT department," not "contractor," but a specific role, decision domain, and escalation order.

What to agree on before deployment

Before launch it's not enough to say the system should be fast. You must define in advance what is normal and what is a problem. Otherwise at the first failure each side will rely only on its own view.

Start by recording the usual load: how many users work concurrently, which operations they perform most often, how much data flows per day, when reports are generated, when exchanges occur, and when backups run. For servers, virtualization, and DBMS these are not details but the basis for resource planning.

Also describe slow performance as the user sees it. Not "low performance," but clear signs: a record opens in more than 3 seconds, a report takes longer than a minute, login takes over 20 seconds. Then the dispute will be about facts, not impressions.

Specify peak hours and seasonal spikes separately. Morning logins, end-of-month processing, enrollment periods, tax season, mass document exports — all change the load. If such peaks aren't considered in advance, it's easy later to blame hardware, the hypervisor, or the database when the real issue was the usage scenario.

Another important item is a list of changes that cannot be made without agreement. Performance is affected not only by replacing servers. Quiet changes often create problems too: redistributing CPU and RAM between VMs, changing storage, altering DBMS parameters, updates, enabling new security checks, moving nightly jobs into business hours.

It's better to collect all this in a short working document. It should record baseline and peak load, measurable signs of slow performance, high-load windows, a list of changes requiring approval, response times, and the incident participant list.

For example, if the system runs on a server platform, in a virtual environment, and has a separate DB team, without such a document even a typical morning slowdown easily becomes a long exchange. It's far more useful to agree in advance not only on architecture but also on rules for handling load, changes, and failures.

How to divide responsibility step by step

When a project includes servers, virtualization, DBMS, and a business application, disputes follow a familiar pattern: the service is already slow and each participant sees only its own layer. To avoid this, the verification chain must be defined in advance.

First you need a single overall service diagram. Not separate diagrams from different teams, but one unified map: user request, application, database, virtual machine, storage, network, and physical server. Even a simple one-page diagram often immediately reveals dependencies and weak points.

After that each layer should have an owner. Not an abstract "infrastructure," but a concrete team that takes the task, checks its part, and gives a clear answer.

The verification order should also be agreed beforehand:

First, confirm the degradation itself using common metrics, not chat complaints.
Then identify the area where latency is growing: application, database, VM, network, or disk.
After that each owner checks their layer using a pre-agreed set of indicators.
If the cause is not in their zone, they pass verified data and conclusions to the next team — not just "it's fine here."
The coordinator assembles the results into one picture and decides what to check next.

Common metrics and thresholds are best kept in one document. These usually include response time, CPU usage, disk latency, memory consumption, execution time of key SQL queries, waits, and acceptable degradation windows. It's important not only to list these indicators but to agree in advance what is considered normal for the project.

This is especially important when infrastructure is composed of multiple layers and contractors. In real projects problems rarely live in a single place. Without a common diagram and a single verification procedure the person who responded last is often blamed.

Which metrics to include in the common document

If a team seriously wants to agree on performance ownership, the document must rely on measurable indicators. Otherwise at the first slowdown the dispute will restart: is the problem in the server, VM, network, or database?

The clearest level is user response time. Record how long is acceptable for login, opening a record, performing a common operation, and building a report. It's important to capture not only averages but also upper limits during peak hours.

The next level is platform resources. For physical nodes and VMs, record CPU load, memory headroom, limit hits, and durations of such states. These data quickly show whether the system has enough compute power or if the problem lies elsewhere.

Describe disk subsystem and network latencies separately. Failures at platform boundaries often hide here: the database looks overloaded when actually it's waiting on disk or losing time due to network delays between nodes.

For virtualization it's useful to fix VM parameters: number of vCPUs, RAM size, disk type and size, overcommit rules, redundancy, limits, and priorities. Without this it's impossible to discuss DBMS behavior fairly, because the same database on two VMs with different constraints will behave differently.

You don't need hundreds of DBMS indicators. A short set that all parties understand is enough: execution time of key queries, number of locks, queue lengths, buffer cache utilization, number of long-running operations, and windows for maintenance jobs.

For each metric it's useful to record five things: where it's measured, what tool collects it, what value is considered normal, who is responsible for checking it, and during which period the measurement is valid. If the system must handle mass morning logins, the document should describe that scenario specifically, not an average daily load.

Example: the system is slow in the mornings

A typical scenario looks like this: from 9:00 to 11:00 employees log in en masse, open records, run searches, and print documents. Support gets identical complaints: pages load slowly, reports hang, sometimes a session stalls for minutes.

At first each team sees only its fragment. Virtualization says the server is available. The DBA reports the database didn't crash. Storage specialists notice increased load but can't immediately prove it blocks users.

When metrics are viewed together the picture clarifies. The VM hosting the database hits resource limits during peak hours — it lacks guaranteed CPU and memory at the moment of the highest incoming traffic. At the same time storage latency increases, so each read and write takes longer. Against this background the DBMS starts queuing queries.

A schedule check reveals another cause: at 9:15 a heavy scheduled report starts automatically. It reads a large amount of data and competes for the same resources as normal user operations.

So the problem isn't in one point. It's a chain: morning user peak, VM limits, increased storage latency, and an inopportunely scheduled report.

Only a joint investigation helps. If teams had looked separately each would find only part of the picture. The solution was also collective: the report was moved to a different time, VM limits and guarantees were reviewed, and acceptable storage latencies for peak hours were fixed. Complaints stopped not because someone "fixed their area," but because the whole chain was addressed.

Common mistakes and traps

The most frequent mistake is simple: each team quickly says "everything is fine here." The virtualization admin doesn't see an overloaded host. The DB team shows queries are written correctly. The server vendor checks hardware and finds no faults. Yet for the user the system remains slow.

This is how responsibility for performance is often lost. The problem lives between layers: in resource distribution, I/O queues, memory settings, storage response times, or the schedule of background tasks.

Equally dangerous is not taking baseline measurements before launch. If nobody recorded how long login takes, how fast a report builds, and what morning load is normal, there's nothing to compare to later. Any postmortem becomes a dispute of opinions.

Another trap is changing several parameters at once. If one day you increase VM memory, rebuild database indexes, and move some data to another storage, the system may indeed behave differently. But it will be impossible to know which change helped.

People also often forget who has the final decision. When there's no person or group to approve the incident conclusion, the discussion drags for weeks. One contractor proposes adding CPU, another suggests DBMS tuning, a third advises waiting for a platform update.

The most expensive failures are a few typical mistakes: lack of baseline metrics, checking layers separately, changing multiple settings at once, vague acceptance criteria, and absence of a decision-maker.

Acceptance suffers too. Phrases like "it should be fast" or "without delays" don't help in a disputed meeting. Only numbers help: how many seconds to open a form, how many users the system must handle, and what increase in response time is acceptable.

A simple test is useful: can a new person open a document and immediately understand who checks the server, who checks virtualization, who checks the DBMS, and who decides in a dispute? If not, the problem was set before launch.

Short checklist before launch

Before start it's important to check not only that the system works, but that later you can quickly find the cause of a slowdown. A short list is enough:

one incident owner is assigned to gather facts and maintain overall status;
a list of metrics and thresholds with specific numbers (not vague phrases) is agreed;
baseline measurements for typical scenarios are taken before go-live;
escalation order across layers is documented;
a re-check scenario after each significant change is defined.

This checklist usually fits on one page but saves many hours during the first serious incident. If the system starts slowing every morning at shift start, the team will already have a common reference: who leads the incident, which numbers to look at, and in what order to check virtualization and the DBMS.

What to do next

The next step is simple: move the conversation from general words to concrete agreements. If the team already understands that a problem can arise between the server, virtualization, and DBMS, fix who checks what first, second, and third.

Often a short 30–40 minute meeting is enough. Invite only those who truly affect the outcome: the system owner, virtualization admin, DBMS specialist, server engineer, and an application representative. The goal is one — remove vague language and agree on a course of action.

After the meeting compile a single working document. Not a long regulation of dozens of pages, but a clear table: which metrics are basic, who collects them, where they are stored, who starts incident analysis, when other teams join, and who confirms the problem is resolved.

It's also useful to plan a load test before go-live. It helps reveal weak points in advance while it's still easy to change settings, resources, or service placement. Equally important is to agree who captures CPU, memory, disk latency, queue, and response time metrics for the database and application at the moment of failure. If this step isn't assigned in advance, the first hours after a problem are lost along with the most important evidence.

A good reference looks like this: at 9:00 the system slowed, by 9:10 metrics were gathered, by 9:20 the primary check owner was identified, and by 10:00 there was a common decision on the next action. Such an order sharply reduces disputes between teams.

If a project includes servers, integration, and several technological layers, involve a party that sees the whole picture. In projects involving GSE.kz this is especially appropriate during preparation: as a server hardware vendor and system integrator, the company can help split responsibilities between infrastructure, virtualization, and software. Then responsibility boundaries will be clear before launch.

The earlier this is done, the lower the chance that the first serious load will turn into a search for blame instead of a quick root-cause analysis.