Event-driven integration between systems: queues and idempotency
Event-driven system integration: how to choose a message queue, make processing idempotent, and handle redelivery calmly.

What practical issues arise with events
Event-driven integration between systems looks simple on a diagram: one system did something and sent an event, another received and processed it.
In reality, an unpleasant fact almost always appears: the same event can arrive twice, sometimes three times. For message queues and distributed systems this is normal.
Duplicates happen not because someone "did it badly", but because the system tries to be reliable. The sender may not receive delivery confirmation due to a short network glitch. The broker may redeliver after a restart. The consumer might crash after processing but before committing the result, and when it restarts it will see the message again. So guarantees often sound like "at least once" delivery, not "exactly once".
The problem is that a duplicate event almost always becomes a duplicate action. The most painful areas are money (double charge or refund), inventory (double deduction), statuses (an order jumps stages twice), notifications (two identical emails or SMS).
Poor integration becomes obvious in operations: numbers don’t match between systems, manual fixes appear in spreadsheets, staff start double-checking every operation, and investigations take hours. Sometimes it’s masked by "it sometimes fixes itself" because a later event partially corrects the picture.
Predictable integration starts with accepting duplicates. Build processing so that a repeated event doesn’t break data. Then the system behaves the same regardless of delivery count: either the action happens once, or repeats are safely ignored and the final state is correct.
Events vs requests: the simple difference
A request is when one system asks another directly: "Give me an answer now." For example, a payment service asks the orders service: "Does this order exist?" and waits for a response.
An event is when a system reports a fact: "Something happened to me." For example: "Order created", "Payment succeeded", "Request status changed." No one must respond immediately, and there may be multiple recipients.
A useful analogy: a request is like a phone call asking a question, an event is like a message in a group chat. Requests need an immediate reply. Events prioritize delivering information to all interested parties.
Events fit well where state changes and several systems need to be notified without tight coupling. This includes status changes, data updates, notifications and audit.
In an event-driven pattern there are two roles. The producer publishes an event and "announces" rather than "asks". The consumer receives the event and decides what to do: update its database, send a notification, trigger a calculation.
Events are almost always asynchronous: the producer doesn’t need to wait for all consumers to finish. It places the event on a queue and continues. This improves resilience to load and failures but introduces realities: events can arrive late, out of order and sometimes repeatedly.
How a message queue works and what it gives you
Message queues decouple systems so they don’t depend on each other at every moment. Instead of a direct "do it now" call one system sends an event (message) to a queue and another picks it up when it’s ready.
A queue acts as a buffer. If the consumer is temporarily unavailable (maintenance, crash, network issues), events are not lost immediately but accumulate waiting to be processed. This matters when the sender must not fail just because the receiver is paused.
Queues also help with load spikes. For example, at 9:00 many users create requests and the sender produces a surge of events. Without a queue the receiver can be overwhelmed. With a queue the spike becomes a steady stream: the consumer processes at its sustainable rate while the queue holds the rest.
When a consumer successfully handles a message it sends an acknowledgment (ack). This signals the queue: "all good, you can remove the message." If ack doesn’t arrive (consumer crashed, hung, timed out), the queue usually tries to deliver the message again. The main takeaway: a queue increases reliability but doesn’t remove the need to prepare for redelivery.
Another myth is that a queue always preserves perfect order. In practice ordering can be broken by parallel processing across consumers, redelivery after failures, or specifics of a distributed queue. It’s better to design handlers to be resilient to duplicates and to "slightly out-of-order" deliveries.
Delivery guarantees and why duplicates are inevitable
Queues most often provide an "at least once" guarantee. It’s an honest compromise: the system tries not to lose events, even if that means delivering them multiple times.
Duplicates don’t mean the queue "broke" — the real world is noisy. A consumer may have processed an event but not confirmed it to the queue. The queue sees no confirmation and sends it again.
Common causes of redelivery: network timeouts, consumer restarts, retries by the queue or application, long processing times (message visibility timeout expired), concurrent processing by multiple service instances.
Why not aim for "exactly once"? Because that requires a shared source of truth about what’s already processed, and it must change together with the business data in one reliable operation. Once processing touches a database, external services, files or multiple transactions, risk appears: one side may have succeeded while another hasn’t. Under load and failures this becomes expensive and complex.
Therefore, for business events the usual choice is "we don’t lose events, but duplicates can occur." The responsibility then shifts to the consumer code: a repeat should be a normal case, not an emergency.
Idempotency: what it is and where it belongs
Idempotency means a simple thing: performing the same action multiple times yields the same end result and causes no extra side effects. For events this is critical because redelivery happens regularly: retries, network issues, consumer restarts.
Two ideas to distinguish:
Idempotent operation — a business action that can be safely repeated (for example, "set request status to PAID").
Idempotent event handler — code that processes an event so that repeated receipt of the same event doesn’t corrupt data or trigger expensive actions twice.
Non-idempotency shows up fastest in payments and logistics: the same payment event handled twice leads to double charging; "order confirmed" repeated causes double shipment; bonuses added repeatedly because the handler "adds" rather than sets state.
It’s usually easier to ensure idempotency on the consumer side. Publishers may be external systems (or several), while consumers control their own data and side effects.
Quick self-check for a handler:
- Does the event have a unique identifier, and do you record processing of it?
- Do you bring the object to the required state rather than "add again"?
- Are irreversible actions (charge, ship, send email) separated and protected from repeats?
If repeating the same event doesn’t change the result, operations become noticeably calmer.
Basic techniques to guard against duplicates
Duplicates appear even with a well-configured queue: the broker may redeliver after a network glitch, timeout or consumer restart. So protection is usually needed on the consumer side, not a hope that "there will be no repeats."
The foundation is a unique event identifier (event_id). It’s usually assigned by the source system, because it creates the event and can guarantee uniqueness within its domain. If the consumer generates event_id, you only protect against your own repeats, not against external redelivery of the same message.
Next you need a registry of processed events: a database table, a key-value store or a separate collection. Principle: before executing an action check whether you’ve already seen this event_id. If yes — acknowledge and do nothing. If not — do the work and record that it was processed.
To make this predictable, store minimal context: event_id, first processing time, status/result, entity key (e.g. order_id, ticket_id), attempt number (if relevant).
A separate topic is external calls to neighboring systems. If the consumer creates a payment, request or user via an API, use an idempotency key. Send it with the external request so a repeated call won’t create a second object but will return the same result. Often that key is event_id or a pair "event type + entity id."
Example: a "Invoice paid" event arrives twice. Without checks you might grant access twice. With a processed-events registry you detect the same event_id and skip the second action, keeping the system correct.
Step by step: making the consumer resilient to repeats
Redelivery is a normal part of queue life. If the consumer isn’t prepared, the same business operation can run twice: money withdrawn, access granted twice, a request created twice. The solution is resilient processing, not banning redelivery.
5 steps that work in most cases
First, agree that every message carries a unique event identifier. This can be event_id (UUID) and, if needed, an entity key (e.g. order_id) and event type.
Then choose where to store the processing mark. Trustworthy places are the consumer’s database (a processed-events table) or a store accessible to the same service.
Next, perform processing and commit the result as one atomic action. A practical approach is a single transaction: (1) check whether event_id was seen, (2) apply the business change, (3) record the event as processed. Then on redelivery you quickly detect a duplicate and finish safely.
After that, configure retries so they help rather than harm. Limit the number and time window of retries, and move "poison" messages to a separate flow for analysis rather than letting them cycle forever.
Finally, agree rules for evolving event schemas. Add fields so old consumers can ignore them, and explicitly include a version (or at least type and change date). This reduces the risk that redelivered old events break newer code.
A realistic example: request status and double processing
Imagine a government organization: one system holds a procurement request, accounting runs entries in another. When a request changes status, integration sends an event to the queue.
A critical moment is the "Paid" status. The accounting consumer receives the event and creates an entry. But the queue may deliver the same event again (for example, after a confirmation timeout). If the consumer lacks protection the result is simple: duplicate ledger entries, extra reconciliation work, manual fixes.
Deduplication usually solves this without magic: each event carries a unique event_id and a domain key (for example request_id + status). Accounting checks before creating an entry whether it already processed that event_id or whether payment for that request already exists.
For this to work predictably agree on rules in advance: who generates event_id and guarantees uniqueness, which key determines "a unique payment" (request, invoice, payment), where the processed-events journal is stored and for how long, what to do if an event arrives before data, and who manages incident resolution.
If an event comes before necessary data, it’s safer not to "guess." Either delay processing and retry later, or fetch the missing data explicitly. Create the entry only when conditions are met and duplicates are filtered out.
Retries, errors and observability without unnecessary complexity
Retries help survive transient failures: a short network drop, DB overload, or temporary unavailability of an external service. But they hurt when the error is permanent. For example, an event with invalid data keeps failing and the service keeps trying, turning the queue into a repository of "dead" messages.
Simple rule: use retries for temporary problems, and for permanent issues fail fast and send the message for analysis.
Dead-letter and delayed processing
Common mechanisms are delayed retries (requeue with backoff: e.g. 1 minute, 5 minutes, 30 minutes) and a dead-letter queue where a message goes after N failed attempts so it doesn’t block the main flow.
This is especially important where a single stuck event shouldn’t hold up the whole stream, like status synchronization of requests or orders.
What to log and which metrics to watch
Logs should be minimal but useful: event_id, entity key (order_id, ticket_id, user_id), attempt number and error cause, and processing result (success, skipped as duplicate, sent to dead-letter).
Metrics usually sufficing: consumer lag, retry share, error rate and dead-letter size.
To determine whether the problem is with the producer or a particular consumer: if the same event_id fails the same way in all consumers, the source data is likely invalid. If only one service fails or failures depend on load, the consumer (code, resources, external dependencies) is more likely at fault.
Common mistakes and pitfalls in event integration
The most painful problems are often about expectations rather than "which queue to choose." People bring habits from synchronous APIs and are surprised when events behave differently.
A frequent pitfall is assuming reliable ordering. Under load messages may be delayed, overtake each other after retries, or land in different partitions. If logic depends on strict sequence you will eventually get incorrect state.
Another mistake is believing the queue eliminates duplicates. Even with good settings repeats are possible due to timeouts, duplicate sends, consumer crashes and network issues. The right approach is to accept redelivery as normal and design handlers to preserve data integrity.
Side effects performed before marking an event processed often break the system: you charge money, create a record or send an email, then fail to save the "processed" mark. On retry you’ll get a double action.
Ignoring versioning is another trap. Fields change, are added or renamed, and an old consumer may crash or silently misinterpret data. Agree on compatibility rules and include an explicit event version.
Finally, many don’t plan recovery after incidents. Minimum set: a clear replay mechanism to reprocess events without doubling effects, a journal of processed messages or idempotency keys, manual reconciliation and correction procedures, metrics and alerts for rising duplicates and retries, and clear ownership for decisions about replays.
Short pre-release checklist
Before releasing event integration, go through basic checks. They don’t make the system perfect but greatly reduce the risk of night incidents when the queue grows and data "drifts" between systems.
- Each message has a unique event identifier and a clear entity key (e.g.
order_id,user_id). - Handlers survive retries safely: duplicate delivery yields the same final state as a single processing.
- There is a reliable place to mark "this event has been processed" (processed
event_idtable, unique index on target record, or equivalent). - Retries are configured deliberately: how many attempts, what delays, and what happens next (dead-letter queue or manual procedure).
- Logs and metrics let you quickly answer "which event failed and why", and there is a manual data-correction procedure.
Sanity check: take a real scenario (e.g. status change) and run three tests — redelivery, a delayed event by a few minutes, and consumer crash mid-processing. If data reconciles after those tests, you’re close to a confident launch.
Next steps: how to approach rollout and support
To prevent event integration becoming endless debugging, start not from the broker but from a clear picture: which business decisions actually depend on events and what’s the cost of error. It’s more practical to start from business scenarios (order created, invoice paid, request canceled) and only then discuss topics, queues and event formats.
A practical plan helps: map systems and events (who publishes and who consumes), define guarantees (what’s acceptable and what’s not), decide where processing state is stored, prepare load and failure tests, and set support (monitoring, alerts, manual intervention rules and update procedures).
Also plan history storage. If you need to answer "why the status changed", a single current value isn’t enough. You need a trace: event identifier, time, processing result, error and attempt count.
When integration infrastructure is built for large organizations (public sector, finance, healthcare), the question often comes down to reliable hardware, transparent support and clear responsibilities. In Kazakhstan this is often handled by system integrators and vendors like GSE.kz (gse.kz), who offer integration services along with GSE S200 Series servers and 24/7 technical support with a country-wide service network.
FAQ
Why can the same event arrive twice or three times?
Because most brokers and queues provide an "at least once" delivery guarantee. If a consumer processed a message but didn’t manage to send an `ack` (or the `ack` was lost due to the network), the broker treats delivery as unconfirmed and sends the event again. This is a normal reliability mechanism, not necessarily an integration bug.
When is it better to use events and when to use requests (API)?
Use a request when you need an answer "right now" and cannot continue without it — for example, checking whether an order exists before charging. Use an event when a system announces a state change that multiple recipients should learn about without tight time coupling. Practical rule: if you are "asking" — use a request; if you are "notifying" — publish an event.
What fields should an event contain to handle duplicates more easily?
At minimum — a unique `event_id` assigned by the source system, and a business entity key such as `order_id`, `request_id` or `payment_id`. The `event_id` helps filter repeated delivery of the same message; the entity key helps avoid logical duplicates when the same action can come from different events. If the consumer generates `event_id`, it won’t protect against external repeated delivery.
Where should the information that an event has already been processed be stored?
Typically a table or collection of "processed events" in the consumer’s database. Store `event_id`, the entity key, timestamps and the processing result so that on repeat delivery you can quickly see the event was already handled and finish safely. It’s important that the record of processing be written reliably, otherwise a repeat will trigger the business action again.
Should processing the event and writing the "processed" mark be done in a single transaction?
Most of the time — yes, inside one transaction: check that `event_id` hasn’t been seen, apply the business change, record the event as processed. Then if a crash happens mid-way you avoid a situation where the action was applied but the processing mark wasn’t saved. If a transaction is impossible, at least protect irreversible steps (money withdrawal, shipment, email) with a separate idempotency key.
How to avoid double-creating a payment/request when calling an external API from the handler?
Send an idempotency key with the external request so a repeated call doesn’t create a second object or trigger a second operation. Practically, the key is often `event_id` or a combination like "event type + entity id" if that better matches business logic. Then repeated events will cause repeated requests, but the external system will return the same result instead of creating a duplicate.
Can you rely on event ordering in the queue?
Don’t build logic assuming perfect ordering. Under load and with parallel consumers, messages can arrive late, overtake each other after retries, or be delivered to different partitions. It’s more reliable to converge the object to a target state and check freshness by version/time/status than to rely on a strict sequential chain.
How to configure retries so you don’t end up with an endless error loop?
Retries are useful for transient issues like short network outages or temporary DB overload. If an error is permanent (bad data, incompatible schema), endless retries only fill the queue and block normal messages. In such cases, limit the number of attempts and move problematic events to a dead-letter queue for investigation.
What to do if an event arrives before the required data is present in the database?
Safer not to "guess" missing data and perform irreversible actions. Usually you either delay processing until the required data appears (with a retry) or make an extra request for the missing information and proceed only when conditions are met. The key is that retries remain safe and don’t create duplicates thanks to idempotency.
What should be logged and which metrics to watch to quickly find integration problems?
Log what helps find a problematic `event_id` quickly: the entity key (`order_id`, `ticket_id`, `user_id`), attempt number and error cause, and the processing result (success, skipped as duplicate, sent to dead-letter). Useful metrics are consumer lag, retry rate, error rate and dead-letter queue size. These let you distinguish transient outages from systemic issues and save hours of investigation when numbers don’t match between systems.