Sandbox for malware file analysis: how to set up and evaluate
How to choose a sandbox for malware analysis, measure its impact, configure integrations (mail, EDR, SIEM) and reduce false positives.

Why you need a sandbox and where it helps
Antivirus and EDR catch many threats, but they have a weakness: new or well‑obfuscated files can look “clean” by signature and reputation. A sandbox fills that gap. It runs an attachment or installer in an isolated environment and watches what the file actually does.
The value appears when behavior matters more than the file’s “name.” What’s dangerous is usually not that a document opened or a process started, but the chain of actions: attempting to download a second stage, changing system settings, persisting to autostart, stealing credentials, contacting suspicious domains, or encrypting files.
There are also gray areas: Excel macros, administrative scripts, corporate agents. These aren’t always attacks, but they require context and rules, otherwise you’ll quickly drown in alerts.
Dangerous files often arrive through several channels: email (invoices, acts, résumés, password‑protected archives), messengers ("quick look at this file", forwarded APK/EXE), web (downloads of “drivers” and “updates”, cloud attachments), removable media (contractor USBs) and shared folders (when colleagues drop a “useful utility”).
It’s important to set realistic expectations. A sandbox doesn’t give 100% detection: some malware can detect a virtual environment and stay dormant. It also shouldn’t block everything by itself. A realistic goal is to shrink the window between the appearance of a suspicious file and a clear decision (allow, isolate, investigate), especially on mail and the web gateway.
A practical example: in a large organization a file named “contract.docm” passes basic checks but the sandbox shows it tries to download a loader and change autostart keys. That’s a reason to stop internal distribution and investigate who sent the document and why.
How file analysis in a sandbox works, simply
A malware analysis sandbox is an isolated “room” where you can safely test a suspicious file. Instead of guessing by name or hash, the system tries to observe what the file actually does.
Checks usually run in two layers.
First, static analysis. The sandbox looks at the file “from the outside”: structure, signatures, packers, suspicious strings, embedded macros, signs of code hiding. This is fast and cheap in resources, but attackers often bypass this layer.
More important is behavioral execution in an isolated environment. The file is “detonated”: run like on a normal machine but inside a controlled VM where all actions are recorded. The sandbox watches whether the object tries to change system settings, spawn processes, encrypt files, inject into memory, download from the network or contact a command server.
A verdict is produced, typically one of: malicious, suspicious, benign, unknown.
The “unknown” status appears more often than desired. A file may stay quiet if it detects it’s in a sandbox or if it wasn’t given the conditions it needs.
Why can the same file produce different results in different sandboxes? Because details of the profile matter: Windows version, Office version, plugins, user rights, network access, system language. For example, a malicious document may activate only on a specific Office version with macros enabled. Without that, it will appear “clean.” Therefore, profiles should resemble your real endpoints rather than a generic build.
Choosing a solution: what to compare besides vendor name
When selecting a sandbox, people often focus on “who has higher detection.” That’s a weak guide. More important is how the solution fits your processes, where it will sit, and what data you’re allowed to send for analysis.
Cloud or on‑prem is the first simple choice: check policies and regulation. Cloud usually launches faster and scales easily, but you send file samples and telemetry outside your perimeter. On‑prem gives more control over data and retention but requires resources, updates and staff to operate it.
Compare real practice, not marketing. For Check Point SandBlast, FortiSandbox and Kaspersky Sandbox, it makes sense to compare the same things on the same file set: identical document, archive and executable types, identical execution conditions and the same wait time for results.
In reality, decisions are often driven by factors other than percentages:
- Integrations: mail gateway, web gateway, EDR, SIEM, SOAR, file stores.
- Clear reports: what exactly happened, which actions the file performed, why the verdict was given.
- API and automation: can you retrieve verdicts, artifacts and statuses without manual work.
- Frequency and transparency of model and signature updates.
- Performance: how many objects per hour, queue behavior and peak handling.
SOC teams usually need artifacts that are easy to correlate: IOCs (domains, URLs, IPs), hashes, process names, registry keys, file paths. If reports look pretty but don’t offer convenient exports, the team will quickly ignore results.
Also check storage requirements: how long to keep samples and reports, where they can be stored and who can access them. These rules sometimes immediately rule out options. In integration projects (for example, with GSE.kz) such an audit is typically done before a pilot to avoid rearchitecting after initial findings.
Measuring value: metrics and basic comparisons
A sandbox brings value only if changes are visible in numbers, not just “feelings.” First decide what you want to reduce: incident count, investigation time, or manual review workload.
Capture a baseline for at least 4–8 weeks before enabling the sandbox. Otherwise it will be hard to tell what actually improved versus what just became more visible.
Baseline: what to measure before deployment
Collect several metrics that SOC, IT and mail already track:
- How many files and attachments per week go to manual review.
- Share of investigations where the cause was a file or attachment.
- Average time from alert to first analyst action (MTTA).
- Average time to close an incident (MTTR).
- How often the same file was checked repeatedly (by hash or name).
After launch compare not only alert counts but alert quality. A good sandbox may increase event counts because it finds things that used to slip through.
Benefit metrics: quality and efficiency
For quality, track confirmed verdicts (true positives) and how many remain “unknown.” The fewer unknowns and the more repeatable the verdicts for identical samples, the easier it is to build automated actions.
For efficiency, measure time to containment: minutes from file arrival to blocking on mail, EDR or proxy. Also calculate analyst hours saved. For example, if 30 checks per day took 10 minutes each and now 20 are closed automatically by sandbox verdicts, the team regains about 3 hours per day.
A common misinterpretation: an increase in alerts doesn’t always mean things got worse. If MTTR falls, manual review share decreases and confirmed verdict share rises, coverage improved. Noise is then reduced by tuning and processing rules.
Where to connect the sandbox: integration map inside the company
A sandbox delivers maximum value when attached to the points where files actually enter the company. Don’t try to “check everything” on day one: you will hit delays and a stream of borderline alerts.
1) Mail and web: main entry points
The mail gateway usually gives the fastest wins. Send attachments that commonly carry risk: Office documents with macros, PDFs, executables, scripts and archives (including nested archives).
Decide separately how to handle password‑protected archives: block or quarantine them, or require the password via a secure channel. Otherwise the sandbox can’t inspect the contents.
Be cautious with web downloads. If you inspect every download synchronously you may break business flows (drivers, updates, large distributions). A common approach: block clearly dangerous types (EXE, MSI, scripts, archives) and use asynchronous scanning for others—files have already reached the endpoint but their execution can still be restricted.
2) EDR/XDR, SIEM and automation: making the verdict actionable
Plan exchanges so results don’t sit in a separate console.
EDR/XDR typically receives hashes, IOCs (domains, IPs, paths, process names) and the final verdict with confidence level. In return, it’s useful to get host context: who downloaded the file, where it resides, whether it executed.
In SIEM normalize fields, otherwise noise appears. Minimum fields: SHA256/MD5, file name, source (mail/web/EDR), user, host, verdict (malicious/suspicious/benign), processing status (submitted/in progress/error), confidence and timestamps.
In SOAR and ticketing automate only safe actions: auto‑enrichment, ticket creation, tagging as “suspicious”, temporary quarantine of the mail. Aggressive steps (hash blocking, host isolation) are best enabled only at high confidence and after the pilot.
Example flow: an accountant receives an archive labeled “invoice.” The message goes to quarantine, the archive is sent to the sandbox, SIEM marks status in progress, and SOAR creates a ticket. If the verdict is malicious with high confidence, EDR blocks execution by hash and scans for the file on other PCs. If suspicious, the ticket goes to an analyst without automatic blocking.
Step‑by‑step deployment: from pilot to production
Start not with installation but with a map of file flows. Where do risky files typically arrive: mail, web, messengers, file shares, USBs, document portals? Enable checks at those points; otherwise the sandbox will analyze things that aren’t your real risk.
Then define clear rules for what to send. Usually filter by type (office docs, archives, executables), size and suspicion level (external mail, file flagged unknown by EDR). This reduces the queue and speeds up user decisions.
A separate item is realistic execution environments. If your company uses Windows 11 and a specific Office version but the sandbox emulates an old OS and different browser, you’ll get misses and false alerts. In practice keep 2–3 profiles: a standard office PC, an accounting workstation with required plugins, and a separate profile for web download tests.
Before the pilot agree status actions to avoid disputes:
- what to block immediately vs what goes to quarantine;
- what to allow under observation (e.g., borderline macros);
- when manual review is needed (SOC or on‑call security);
- how quickly to respond to business reports of false positives;
- who can grant a temporary exception and how it’s documented.
Enable logging and verify delivery of events to SIEM and EDR. You need not only “verdict: malicious” but context: file source, user, hash, action chain. Practically, this saves hours during triage.
Run the pilot on a limited group: a single branch or a high‑risk department (finance, procurement). After 2–3 weeks expand in phases, reviewing submission rules and actions after each step. If an integrator runs the deployment, this staged approach helps resolve conflicts with mail, proxy and SIEM before they occur.
How to avoid drowning in false positives: tuning and discipline
False positives arise not because the product is bad, but because legitimate activity can resemble malware. Legit PowerShell scripts, office macros, admin utilities (remote management tools), in‑house installers and updaters may spawn processes, change the registry and contact the network. To a sandbox this looks suspicious.
The key idea: separate “suspicious” from “certainly malicious.” For medium‑confidence events, start with observation not blocking. Often better to quarantine the file, limit execution to a subset of users, or request confirmation from the app owner.
To prevent exceptions becoming holes, set clear rules for creating and reviewing them. Use exceptions by vendor (trusted signed publisher), by hash (specific version—requires updates each release), by path (only for controlled folders), by user group (admin tools only for IT), and by expiration (every exception should have a review date and owner).
Discipline thresholds and alert prioritization. Show analysts high‑confidence, high‑impact events first (e.g., execution from mail or temp folders, attempts to persist). Put the rest in a separate queue for daily review.
Practical example: accounting uses Excel with macros and IT runs scripts for driver installs. Initially the sandbox generates many alerts. Identify two safe business apps, add exceptions by publisher and user group, and keep macros in “suspicious but not blocked” mode with notifications. Result: less noise while real malicious activity still stands out.
Common deployment mistakes and how to avoid them
Disappointment usually stems from how the sandbox was integrated, not detection quality. Mistakes repeat and are fixed with settings and agreements.
Mistake 1: enabling blocking immediately
If you enable strict blocking by sandbox verdict on day one you risk halting normal processes: mailings, finance files, software updates, partner document exchange. Start in observation mode: collect verdicts, see what triggers, and only then enable blocking for pre‑agreed categories.
Mistake 2: no owner for exceptions and no rule review
Exceptions are inevitable, but without an owner they grow uncontrolled. Minimum practice: assign an exceptions owner (a role, not “everyone”), set exception life (30–90 days) and review monthly what to remove or replace with precise rules.
Mistake 3: one policy for all departments
Different teams have different risks and delay tolerance. Finance gets archives and macros, devs get tools and scripts, call centers need fast attachment opening. Create 2–3 policy profiles (strict, normal, permissive) and map them to user groups and channels (mail, web, file share).
Mistake 4: integrating with SIEM without normalization and deduplication
If events stream into SIEM “as is” you’ll get many duplicate alerts: the same attachment may trigger multiple records from different sources. Agree on field schema (hash, file name, source, user, verdict, confidence) and deduplicate by hash and time so analysts see one incident, not noise.
Mistake 5: expecting sandbox to solve everything
A sandbox signals risk but doesn’t replace the response process. Define who decides actions for malicious vs suspicious, how to request files from users, and how to roll back false blocks.
A simple approach: if an attachment is suspicious, open a ticket and ask the sender to confirm purpose. Block only on repeats, poor reputation, or an additional signal (for example from EDR). This lowers risk and avoids paralysis.
Short checklist before launch and for the first week
Before starting, ensure files actually reach analysis and verdicts go to decision makers. Otherwise the sandbox will run but you won’t see benefits.
Before launch (day or two before)
Run through this list and record results in a short doc (who checked what and when):
- Ensure all file submission points are configured and not truncating flows: mail, web gateway, proxy, file shares, EDR. Check size limits, archive handling, password‑protected attachments and uncommon types (scripts, office macros, ISOs).
- Open sample reports and verify they include what investigators need: IOCs, process tree, network calls, registry and file changes. If reports are pretty but empty of substance, they’re hard to act on.
- Verify where verdicts land and who sees them: mail queue, EDR console, SIEM, on‑call team. Alerts must not go into an unread “general inbox.”
- Review auto‑actions and leave only safe ones: mail quarantine, tagging in EDR, incident creation. Enable host isolation and user blocking later when trust in verdict quality grows.
- Predefine how exceptions are recorded: approver, duration, reason and cancellation procedure.
First week (to avoid drowning)
Hold a short daily 15–20 minute ritual: review top‑5 findings, confirm actions taken and mark false positives. If the sandbox repeatedly flags an internal accounting installer, create a temporary exception for the specific hash or signature and request a clean build from the app owner.
At week’s end produce a mini‑report: how many files were analyzed, how many verdicts reached mail/EDR/SIEM, how many false positives and recurring causes. Use it to tune integrations and decide which auto‑actions can be safely expanded.
Example scenario: reducing risk without overblocking
Imagine a company that receives dozens of external emails daily with attachments: invoices, acts, requests, archives. Mail filters are in place, but occasionally files that look normal hide macros or loaders.
To avoid disruption, the pilot connects the sandbox selectively rather than to the entire stream. The logic: check what actually carries risk and leave routine traffic untouched.
How it works in practice
The sandbox is connected to the mail gateway or EDR with limits. Send to analysis attachments from new domains (unseen senders), rare and risky file types (archives containing executables, macro‑enabled documents). The rest flows through existing checks without delay.
When an attachment looks suspicious it isn’t silently blocked: it goes to quarantine and the user receives a clear notice explaining what happened, where the file is, and what to do if it’s urgently needed (e.g., request a security check).
What the analyst does and why false positives drop
In week one an analyst triages recurring cases: accounting templates with macros, specific archives from a contractor, internal utilities. Don’t “allow everything”; add precise exceptions by sender, hash, certificate chain or file fingerprint only when safe and verified.
After a month two effects usually emerge. First, fewer repeat manual investigations: recurring files no longer surface as new incidents. Second, risk statistics become clearer: how many attachments were analyzed, how many were truly malicious, how many quarantines were validated. This approach scales easily when rules and exception governance are in place.
Next steps: organizing a pilot and cementing results
Start by gathering requirements, otherwise the pilot will turn into a debate. Document where you want to catch malware (mail, web, EDR, file shares), which integrations are available and what data restrictions apply (personal data, classified info, bans on sending samples to external clouds). Specify expectations for response time: acceptable verdict wait times and handling of files stuck in queue.
Plan the pilot for 2–4 weeks and agree in advance how you measure success. Capture baseline metrics: incident counts, investigation times and common gaps.
Roles and responsibilities
A frequent cause of failure is no single owner. Assign owners and escalation rules:
- Security (IB): blocking policy, data permissions, final risk decisions.
- IT: integrations, mail and proxy gateways, routing and access.
- SOC: daily triage, feedback on false positives, triage rules.
- Business app owner: exceptions, critical processes, what must not be broken.
- Operations: monitoring, backups, updates and incident recovery.
Infrastructure for an on‑prem sandbox
If on‑prem is required, assess compute and storage needs: peak file throughput, parallel analyses, artifact/report retention and redundancy. A common bottleneck is not licensing but the analysis queue and lack of disk for dumps.
To cement results, record pilot metrics (confirmed threats share, time to verdict, manual work reduction, number and quality of exceptions) and formalize a regulation: who adds exceptions, monthly rule reviews, and how to react to mass false positive events.
If you need an integrator, choose one that covers the full cycle: server sizing and deployment plus integration and support. In Kazakhstan some tasks are conveniently handled by GSE.kz: as a systems integrator and hardware vendor they can supply on‑prem infrastructure and help with integrations (mail, SIEM, EDR) as a single project.
FAQ
Why do we need a sandbox if we already have antivirus and EDR?
A sandbox is needed to observe **file behavior**, not just a signature or reputation. It runs a suspicious object in isolation and shows whether it tries to download a second‑stage payload, persist in the system, change the registry, contact suspicious domains or encrypt data.
How does a sandbox check a file in simple terms?
Analysis usually has two steps: a quick static inspection (structure, macros, packers, suspicious strings) and then execution in an isolated environment with action recording. The output is a verdict like “malicious/suspicious/benign/unknown” plus technical details you can use in an investigation.
Where should you connect a sandbox first: mail, web or EDR?
Email usually gives the fastest benefit because many risky attachments arrive there. Web downloads are the next most valuable channel, but you must avoid breaking business processes with delays—start with risky file types and clear rules.
What to choose: cloud sandbox or on‑prem?
Cloud is easier to start and scale, but you send samples and telemetry outside your perimeter, which may be forbidden by policy or regulation. On‑prem gives better control over data and retention but requires servers, storage and operational staff.
What to look for in a sandbox besides brand and detection percentages?
Look beyond raw detection rates. Check how the solution fits your processes: integrations with mail, proxy, EDR, SIEM and ticketing; whether reports are actionable and whether verdicts and artifacts are accessible via API. If a report is pretty but you can’t extract IOCs and context, teams will start ignoring it.
Which metrics actually show sandbox value?
Start with a baseline for 4–8 weeks: how many files are checked manually, how many incidents involve attachments, current MTTA and MTTR. After deployment, measure time to isolation/block, SOC hours saved, and the share of files left as “unknown.”
Why does the same file sometimes have different results in different sandboxes?
Differences happen when execution profiles differ from your endpoints or when the sample detects the VM and stays quiet. Use 2–3 realistic profiles (OS version, Office, permissions, language, network access) and reasonable timeouts so results match real life.
How to avoid drowning in false positives after launch?
Don’t try to inspect everything at once and avoid enabling hard blocking on day one. Start with quarantine and manual review for borderline verdicts, then add precise exceptions and prioritization so analysts see high‑confidence, high‑impact events first.
What to do with password‑protected archives and unusual attachments?
A sandbox can’t open a password‑protected archive without the password. Treat such attachments specially: quarantine them and request the password through an agreed secure channel, otherwise you’ll have a blind spot for a common malware delivery method.
How to run a sandbox pilot and lock in results?
Plan a 2–4 week pilot and agree in advance the actions for “malicious” and “suspicious” verdicts so decisions aren’t argued in the moment. If on‑prem is required, assess compute and storage needs as well as integrations with mail, SIEM and EDR; GSE.kz can help in Kazakhstan with infrastructure and deployment as a unified project.