Logs and Audit for Incident Investigation: a Preparation Plan
Logs and audit for incident investigation: which logs to collect, where to store them, how to set up correlation and reduce analysis time to hours.

Why logs and auditing are needed specifically for investigations
Without logs, an investigation quickly becomes a debate of opinions: someone is sure it was a user, someone else blames the network, and there’s no time to check. Logs provide a factual basis and help reconstruct the chain of events as confidently as camera footage, but in digital form.
A good log answers simple questions: who performed the action, what happened, where it happened, when, and how it ended (success, failure, error). If any of these elements are missing, the picture falls apart.
It is important to distinguish three layers of data:
- Operational logs show the state of systems and immediate errors.
- Audit records capture significant actions and changes (logins, permissions, admin operations) and are especially important for investigations and compliance requirements.
- Telemetry describes behavior (for example, unusual process launches and network connections) and helps find hidden attacks.
Investigations almost always involve several teams. Security formulates hypotheses and indicators of compromise, IT and admins check the infrastructure, the service desk gathers facts from users, and the business owner clarifies what may have been affected. For example, a suspicious login to the accounting system usually requires authentication logs, folder access audit, and mail events.
Success is measured not by the number of logs but by time and repeatability:
- MTTD: how long until detection.
- MTTR: how long confirmation and analysis took.
- Completeness: whether there is enough data to confidently say what happened.
- Repeatability: whether the same method can be used to analyze the next incident.
If these metrics improve, collection and correlation really work, and investigations take hours, not weeks.
What to log: a minimal set by source
Even a "mandatory minimum" noticeably speeds up investigations. The goal is simple: use logs to reconstruct the chain of actions from the first login to data access. Primarily you need to see who logged in, from where, what was changed, and which resources were accessed.
Start with authentication and access management. You need successful and failed logins, MFA events (request, success, failure), account lockouts, password resets, role and group membership changes. Always record the user identifier, source (IP, device), authentication method and reason for failure.
Network often reveals what isn't visible on the host. Useful sources include DNS (which domains were requested), proxy or web gateway (which resources were opened), VPN (connections and assigned addresses), and firewalls (allowed/denied). If you have NetFlow or equivalent, it helps quickly understand where and how much data was sent even without decrypting traffic.
On endpoints, collect OS and security system events, software installs and removals, external media connections, and EDR/antivirus events (detections, quarantines, exclusions). If available, add process launches. Application logs important to users—like a VPN client or crypto provider—are also valuable.
Mail and collaboration tools are often an entry point. You need mailbox login facts, attachment opens and downloads, sends and forwards, and creation of rules (especially auto-forwarding).
For critical systems (DB, ERP/CRM, file stores) record authentication, access to sensitive tables and folders, bulk exports, permission changes and admin operations.
To avoid drowning in volume, check the minimum for each source: who acted, when, from where, what exactly happened and what was accessed.
Example: an employee "suddenly" logged in at night via VPN. The combination of VPN + mailbox login + rule creation + mass file reads on a share will reveal the whole picture in hours, even if the PC has already been "cleaned up."
Time and format: so events line up in one timeline
In an investigation, the most important thing is to quickly build an accurate timeline. If one server is 3 minutes fast, a workstation is set to a different timezone, and cloud logs are written in UTC without a clear marker, correlation becomes guesswork.
The first rule is unified time synchronization. Configure NTP on all nodes: servers, workstations, network equipment, hypervisors, and security systems. Verify that synchronization is not just enabled but actually working (no network blocks and correct time sources).
Store time unambiguously. A practical approach is to record the event time in UTC and also store the system timezone (or offset) so local context can be restored if needed. Where possible, add milliseconds: when passwords are being tried or bulk requests occur, a one-second difference is already critical.
To make events from different sources fit into a single line, agree in advance on basic fields:
- timestamp (UTC and precision)
- user (login, UPN, SID or other stable identifier)
- host (name and role, e.g., workstation or server)
- src_ip (source) and, if available, dst_ip (destination)
- action and result (what was done and how it ended)
Then enable normalization. The same action should be named the same way even if different systems write it differently. For example, Windows writes "Logon", VPN - "Authentication", and an application - "Sign-in". In the store it's better to reduce this to a single event type so correlation rules don't multiply.
One more basic risk is loss in transit. This most often happens during load peaks or connection drops. Local buffering on agents, retries on failure, queue monitoring and priority for critical logs help. Regularly check delivery metrics: how many events arrived and how many were dropped.
Where to store logs: practical placement options
The problem is usually not a lack of events but that they are scattered across servers and workstations and some have already been overwritten. For investigations you almost always need centralized collection: a single source of truth, unified access rules and clear retention.
The most practical start is a dedicated server (or pair of servers) for the collector and storage in a protected network segment. Ideally this is a zone where many systems send events but ordinary admins cannot quietly delete or modify records.
On-prem, hybrid or cloud
On-prem is justified if there are requirements to keep data within the organization or it’s important to retain full control of access. Hybrid is often more convenient: "hot" logs (recent days) are stored locally for fast searches, while the archive goes to a second, cheaper storage. Full cloud makes sense when there's no team to support hardware and you need quick scaling, but check data and access constraints in advance.
Performance and resilience
Size capacity based on the stream: how many sources, how often they write events, average record size and required retention. Add headroom for incident-time peaks.
To avoid a single collector failure undoing an investigation, design at minimum: buffering at sources, a backup intake node, separate archive storage, regular delivery and read checks, and restricted admin access to the store plus an audit trail of actions.
Retention periods, access and protection against tampering
If logs are gone or can't be trusted, the investigation becomes speculation. It's important to define in advance how long different log types are retained, who has access, and how to prevent tampering.
Retention is convenient to set in three horizons. Short — for quick investigations (detailed application logs and network events). Medium — when an incident wasn't noticed immediately (OS audit, AD, mail, VPN). Long — for rare but serious cases and audits (critical security logs, permission changes, key business systems). A common mistake is keeping everything equally long: expensive and not very useful.
Immutability of logs is critical. If an attacker gets admin access, they often try to "clean" traces first. Minimum measures:
- write to a centralized store with deletion/overwrite prohibited (WORM/immutable policies)
- a separate account for log intake without read rights
- integrity control (hashes, signatures, periodic verification)
- scheduled backups of logs, separate from the main system
Separate access by roles: some read and investigate, others change collection settings, a third group administers storage. No one should combine all rights at once. For organizations with stricter requirements (government, finance, healthcare) this is often mandatory for internal control.
Encryption matters not only "at rest" but also in transit and in key management. Define in process documents where keys are stored, who rotates them, how access is revoked when someone leaves, and what to do if compromise is suspected.
Document retention and access rules in writing: in an information security policy or operations regulation. Specify retention periods, responsible persons, access issuance procedures and the format of extracts for investigations.
How to set up log collection: a 2–4 week step-by-step plan
For logs to actually help, the most important thing is not "to install a collector" but to agree on sources, responsibilities and checks. Below is a plan that usually fits into 2–4 weeks even in heterogeneous infrastructures.
Work plan by week
Week 1. List systems that almost always participate in incidents: AD/IAM, mail, VPN, firewalls, servers, critical applications. Assign an owner to each source: who enables auditing, who changes settings, who confirms data correctness.
Week 2. Enable required audit policies and logging levels. Start with the mandatory minimum: logins, access failures, permission changes, creation of new accounts, admin actions. This yields value without a flood of noise.
Week 3. Configure sending to the central store. Use an agent where possible (it’s usually more reliable), and syslog where agents aren’t available. Fix the field format (time, host, user, IP, event) and a single timezone.
Week 4. Run a test period and prepare an incident response procedure. For systems with higher requirements add log tamper protection and segregated access from the start.
Before declaring the work done, perform a quality check with a real scenario: one employee logs into VPN, changes a password, gets an access denial, and an admin creates a test account. You should see the end-to-end chain of events in minutes.
Formalize the process in one document: who watches alerts, how quickly they respond, where to escalate and what is considered an incident.
Correlation and alerts: what to set up first
Correlation is needed to turn dozens of disparate records into a coherent story: who, where and what they did. If centralized collection exists, the next step is simple rules that catch common attacks and errors without tons of false positives.
Start with correlations that are almost always useful and easy to explain to the business:
- multiple failed logins from one address or for one account
- successful login after a series of failures, especially if followed by mail or file access
- login from a new device or an unusual location for that user
- privilege escalation (addition to admin groups, granting rights on critical systems)
- disabling protections or clearing logs on a workstation or server
Then move to "chains." Example: a user opened an email, then a nested attachment ran, a new network process appeared and started accessing a file store with sensitive data. Separately these events may look normal, but together they give a clear signal.
To make alerts more accurate, add enrichment: an asset inventory (what the server is and where it’s located), system criticality, a list of admin accounts, device ownership by department. The same event on an accounting PC and on a test machine should have different priority.
Tune thresholds gradually. Start with soft thresholds and collect statistics for 1–2 weeks, then tighten. If there are too many alerts, reduce noise with exceptions (known scanners, admin servers, maintenance windows) rather than disabling rules.
When a rule fires, the alert should immediately include context: who and where, what triggered, a chain of key events before and after (about 5–20 lines), asset importance and system owners, and a clear first action (check MFA, reset password, isolate host).
How to conduct an investigation: a simple algorithm
Investigation is easiest when done as a sequential fact-check. The goal is to quickly understand what happened, when it started, what was affected and which data support the conclusions.
First collect "anchor" fields you will use to stitch events together: user (login, UPN, email), device (hostname, agent ID), IP and geography, process (name, path, command line), URL or domain, and, if available, file hash. These fields are useful to note down and add to as you go.
A 60–90 minute algorithm
Then follow one path:
- record the initial indicator (alert, complaint, anomaly) and the exact time with timezone
- build a timeline: events before and after until you see the attacker’s last action or the moment of containment
- compare with the user and device "normal" behavior (typical login hours, usual IPs, standard apps)
- identify confirmed facts and mark where data are missing
- preserve artifacts: events, screenshots, hashes, list of affected accounts and devices
Example: you see a suspicious mailbox login at night. In the timeline you find an email with a link, a click, a login from a new IP, creation of a forwarding rule, and attachment downloads. Normal behavior shows the user usually works daytime and from another city.
What to record in the conclusion
In conclusions separate facts and hypotheses. Support facts with concrete fields (time, user, IP, action). Note gaps: which logs are missing and what that prevented you from proving. If legal issues are possible, prepare a package in advance: timeline, list of artifacts, who and when exported logs and where they were stored unchanged.
Incident example: phishing, suspicious login and data exfiltration
An employee received an email with an "invoice" and opened the attachment. Within 10–15 minutes the mailbox was accessed from a new device, and someone began viewing and downloading files from the corporate store. An hour later colleagues noticed strange emails sent "from" the employee.
To investigate these cases quickly, events from different sources must form a single timeline. Usually four groups of data resolve it: mail, proxy and DNS, endpoint events, and login and file access logs.
How compromise is confirmed
The picture becomes clear when several signals align:
- mail: attachment opened or link clicked, then a forwarding rule created or settings changed
- DNS/proxy: requests to the domain from the mail and subsequent connections to rare external addresses
- endpoint: process launched from the downloads folder, a new scheduled task, or unusual PowerShell activity
- logins: successful authentication from an unusual place, new device or at an unusual time
How to narrow the scope and reach conclusions
Narrow the investigation along four axes: time (first minutes after opening), node (the specific PC), account (who logged in and from where), and accessed objects (which files were read or downloaded). If data are missing, that is itself a finding: often DNS logs, file access audit or mail rule change records are absent. Add these sources, set a single timezone and store raw events, and a similar incident next time will take hours rather than weeks.
Common mistakes and traps in logging
The most frequent problem is spending money but not getting an answer to the main question: what happened and when. Mistakes are usually simple and repeat across projects.
The first trap is collecting too much without a purpose. When everything is fed into the system, important events drown in noise and storage/processing costs rise. Define in advance which event types are needed for investigations (logins, permission changes, admin utilities run, data access) and filter out the rest.
The opposite extreme is too few logs or "poor" events. Records may exist but lack key fields: who, from where, what exactly, the result and what object the action refers to. Then investigation becomes conjecture due to lack of context.
The third trap is poor time synchronization (NTP). If servers drift by a couple of minutes, the attack chain won't line up: it becomes hard to confidently link an email, a file launch, a login and resource access.
The fourth mistake is storing logs where an incident can occur. If an attacker gains access to the server, they often try to delete logs first. Critical logs should be sent to a separate store or circuit with restricted access.
Another common problem is lack of clear ownership. Alerts are configured but it’s unclear who looks at them, who confirms incidents, who collects evidence and what the response times are. At minimum, assign: an owner for each log source, an on-call for alerts, and an escalation procedure.
Short readiness checklist for investigations
When an incident happens, time is often spent not on finding the attack but on figuring out where traces exist. This checklist helps quickly assess readiness.
Source coverage. Events consistently arrive from all critical systems: domain controllers and IAM, mail, VPN, EDR/antivirus, network devices (firewall, proxy), application servers, databases, cloud services (if present). For on-prem infrastructures, ensure collection doesn't break after updates or agent changes.
Time quality and linkability. Timestamps match (synchronization, unified timezone approach), events have identifiers to stitch them: user, host, IP, session or request ID.
Search and access. An investigator finds "who, from where, to where and when" within a minute: search by user, IP, host and time range; results narrow quickly with filters; access to logs is provisioned in advance, not "during the fire."
Storage and retention. Retention periods by log type are clear, there's enough space and alerts for filling. Critical logs are protected from tampering and deletion at least by access restriction and change control.
Correlations and escalation. There are 5–10 working rules with clear alerts and a defined path: who reviews, in what timeframes, and who receives escalation.
Next steps: how to lock in the process and maintain the infrastructure
A working logging contour is a process, not a one-off project. A reliable approach is to start with a pilot on 1–2 critical services where downtime costs most (for example, mail and VPN, or domain and file server). In the pilot you quickly see which events are missing, where collection breaks, and what a "normal" investigation chain looks like.
Then codify rules in a regulation: not only what to collect but who is responsible. Otherwise auditing will be enabled ad hoc.
Minimal regulation that saves time
The document should answer five questions:
- who enables and checks auditing on sources
- who is responsible for storage, retention and backups
- who and how gets access to logs (roles, requests, timeframes)
- what is considered an incident and when to raise priority
- how often collection is checked (daily or weekly)
Also plan for growth. Event volume almost always increases after a pilot, so include headroom for disks and CPU and redundancy so the log store does not become a single point of failure.
If you build or upgrade an on-prem infrastructure for event collection, it can be easier to rely on ready server platforms and integrations. For example, GSE.kz (gse.kz) as a manufacturer and system integrator in Kazakhstan supplies S200 Series servers and assists with designing and supporting IT contours, including 24/7 mode.
Separately arrange support for the logging contour itself. A typical situation: collection was set up, an agent stopped sending events a month later, and the investigation is again empty. With clear ownership, regular checks and a defined escalation channel, investigations start taking hours, not weeks.