Archival document storage: ECM, WORM and tests before procurement
Archival document storage: how to split responsibilities between ECM and WORM storage and which tests to run before procurement — performance, immutability and audit.

Why plan the archive in advance and who is responsible
Archival document storage usually fails not at the moment of "dropping the file in", but later: when after a year or five you need to quickly find and deliver a document, prove it wasn't changed, and show a clear action trail. If this isn’t planned from the start, the archive becomes a "just in case" folder you can't trust.
Three problems typically surface: search is slow or returns irrelevant results; delivery SLAs are missed (especially during inspections or court requests); and "immutability" rests on promises rather than verifiable mechanisms. In government bodies and large companies this quickly creates risks: disputes over documents, auditor findings, fines, and leaks due to excessive access rights.
From the beginning it’s important to split roles between ECM and storage. ECM is responsible for meaning and usability: cards, metadata, workflows, rights, versions, lifecycle and search. Storage is responsible for the "hardware" side: where and how content is kept, how WORM is enforced, how copies and recovery are done, and what delivery-speed guarantees exist.
Before tendering or procurement it's useful to answer several questions on paper: which documents are kept and for how long, who can read and export them; what metrics matter (search time, delivery time, concurrent requests, volumes); what counts as immutability (delete prohibition, no overwrite, metadata protection, retention); what audit looks like (who logs what, where logs are stored, can they be tampered with); and how acceptance testing will proceed (scenarios, tests, pass/fail criteria).
When roles and metrics are clear, conversations with vendors become substantive: it's easier to compare solutions and demand tests instead of relying on presentations.
ECM, WORM and search: a simple responsibility map
For an archive to work for years, responsibilities must be split in advance: what ECM does, what storage does, and where WORM is ensured. Otherwise a gray zone appears in the project—and risks hide in that gray zone.
ECM treats the document as an object: accept the file, assign a type, fill the card, link to a case, start approval routes, grant access. This is the layer of processes and rules for people.
Storage handles the long life of the file: physical placement, copies, integrity checks and immutability mode. If WORM is required, the storage layer (or a dedicated WORM layer) must guarantee the file cannot be overwritten, deleted before its time, or silently replaced.
Search is almost always split. ECM searches metadata well (number, date, counterparty, document type) and structure. Full-text search (OCR index of scans) can be in ECM, a separate search engine, or partially in storage. This should be written down: where the index is built and who is responsible for its freshness.
A handy responsibility map for vendor meetings:
- ECM: cards, workflows, roles and rights, business rules, delivering documents to users.
- Storage: storage, copies and replication, WORM, integrity control.
- Search: metadata in ECM, content index where full-text is actually executed.
- Audit: user events in ECM, content-level events (write, lock, modification attempts) at storage.
Gaps usually appear in three places: who performs OCR and maintains the index; where WORM is enabled (ECM promises it while storage hasn’t turned it on); and where audit is stored so it can't be "cleaned" together with the problem document.
Requirements to fix before choosing a vendor
To prevent the archive turning into a list of promises, write requirements so they can be verified in a pilot. Describe not "the system in general" but concrete documents, users and operations: what is uploaded, how people search, how documents are delivered, who signs, who inspects, who may delete (and who may never delete).
Tie functional requirements to retention rules. Specify formats (PDF/A, TIFF, Office, email), retention periods by document type, classification rules, roles and rights (operator, archivist, security, auditor). Clarify what counts as "delivery" (view, file export, print, export with signature/stamp) and which events must be logged.
Non-functional requirements must be measurable. Instead of "searches quickly" set numbers: search by fields within 3s on a 50M-card database; opening found document within 5s for a 20 MB file; RTO 2 hours, RPO 15 minutes; maintenance window no more than 4 hours/month; growth +2 TB/month over 5 years.
For audit specify mandatory fields: who did what, when, from where (IP/workstation), which document, and how long logs are retained. For integrations list sources and scenarios: scanning, email, ERP, HR, EDI.
Useful rule: every requirement should become a test. A handy template: "given, when, then." Example: given 10 concurrent users, when they search by number and date, then 95% of queries complete within the target time and all actions appear in the audit report.
Immutability (WORM): what exactly to protect
WORM is useful when the archive must prove a fact: the document existed in this form at that date and couldn't be retroactively edited. So describe in advance which objects and actions fall under prohibition.
Typically WORM applies to final, legally significant versions: contracts, invoices, HR documents, medical records, audit results, protocols. Usually overwrite and deletion are prohibited, and retention is enforced so no one can remove the document before the retention end—even an admin.
Agree what counts as a change. Sometimes the system locks the file content but quietly allows edits to the card: author, date, tags, links. For an archive this is critical.
Ensure protection covers:
- file content (byte-for-byte)
- key metadata (date, number, counterparty)
- electronic signatures and timestamps (if used)
- versions and history (past versions cannot be overwritten)
Corrections should be recorded as a new version or a new document explicitly linked to the original: "correction to", "annuls", "replaces". The old object remains immutable and available for inspection, while users see which version is current.
Form acceptance criteria as checks. Example: attempting to delete a document before the retention end must be denied; attempting to replace the file or change key metadata must be blocked and logged; after admin change and service restart the object remains unchanged and readable.
How to prepare a testbed and load scenarios
A testbed shows archive behavior in real conditions: when accounting bulk-loads documents, lawyers search for a 2017 contract, and inspectors require a package within an hour. The closer the test is to reality, the fewer surprises after procurement.
Start with a test dataset that resembles your archive in format, age and "dirt": scans of varying quality, long filenames, mixed languages. Keep a set that can be moved between solutions unchanged. Check type diversity (text and scanned PDFs, DOCX/XLSX, images, occasional ZIP), sizes (hundreds of KB to hundreds of MB), time span (at least 5–10 years), roles and access levels, and metadata where some records are complete and others have gaps or errors.
Describe load profiles to repeat under the same conditions (data volume, workstations, network, time window). Usually five scenarios are enough: bulk ingest with indexing; "normal day" (field search, open card, download); 15–30 minute peak; long run 2–4 hours; failures (service restart, temporary network loss, retry delivery).
Keep metrics simple: search time, delivery time, error rate and stability (does speed degrade over an hour?). For comparison between vendors a table of medians and "90% of requests faster than…" is sufficient.
Save artifacts so results can be defended at acceptance: ECM and storage logs, CPU/RAM/disk/network metrics, test reports, screenshots of settings, protocol of conditions (who, where, when, dataset). This is especially helpful if someone else built the testbed.
Checking search and delivery speed: step by step
Speed matters not only for convenience. Users, auditors and security care how fast a document is found and the original delivered, especially under different access rights. Fix test conditions in advance: volume, file types, user count and identical queries.
Use a realistic baseline: scans (PDF), office files, emails, contracts. Metadata should be realistic: number, date, counterparty, department, status, retention.
A convenient verification process uses p50 and p95 (median and near-worst):
- Upload a batch with metadata and record time until search is fully available. Measure not only ingest time but the moment the record appears in results and opens without errors.
- Test two search types: by fields and by text (if OCR/full-text exists). Use the same 10–20 queries and measure time from click to results.
- Measure delivery: open card, show preview, download original. Repeat for different roles (employee, manager, archivist): rights checks can slow delivery.
- Apply parallel load and watch p95 changes, timeouts and preview failures.
- Repeat key measurements after service restarts and reindexing to see warm vs cold differences.
Present results in a table: query, role, concurrent users, search time, preview time, original download time. That makes acceptance criteria straightforward.
WORM and tampering tests: what to try
WORM promises: write once, and you cannot silently replace or remove the record. In practice risks appear where ECM and storage meet: retention may be bypassed, admins may have "super rights", or metadata edits change a document's meaning.
Plan tests from both sides: via ECM (as normal user and as admin) and directly at storage level (as storage admin). Protection must work in both.
Five checks to perform before procurement:
- Try to change the file via ECM (replace attachment, re-upload, update version). Expected: blocked or saved as a new version; original stays unchanged.
- Try to replace an object directly in storage (overwrite, restore over). Expected: write rejected and attempt logged.
- Try to delete or bypass retention (weaken retention, set dates in the past, remove links). Expected: retention cannot be weakened without a strict procedure and roles.
- Test restore from backup: can another file be slipped under the same identifier? Expected: the exact same object is returned with immutable characteristics.
- Check integrity: checksums before and after operations and compare exports. Expected: hashes match and any change appears as a new version or is rejected.
After each test review logs. If an action is denied but leaves no trace, that’s a red flag: tampering could happen silently.
Audit and logging: how to ensure traces don’t vanish
If the archive is used for inspections and disputes, it’s important not only to keep files but to prove who did what. Audit usually breaks in two ways: events are incompletely logged, or logs can be quietly altered.
Events that must be logged
Ask the vendor to describe what actions are recorded and verify it in the pilot. Minimum set: logins (success/fail), password changes and locks; viewing a card and opening a file; downloads, printing, external sends (if available); metadata, rights and retention changes; admin actions (role creation, object deletion, integration settings).
Each entry should include: who (account), what was done, to what (document/version ID), from where (workstation/service) and result (success/fail).
How to check an admin cannot "clean up" traces
Best practice is role separation: the ECM admin manages users and settings but does not have quiet access to delete or edit audit. On the testbed try a scenario where an admin changes rights and then attempts to hide that change.
Checks to perform before procurement:
- Turn off auditing in settings and perform an action (there should be a record that auditing was disabled).
- Try to delete or overwrite logs using standard tools (should be blocked or logged as a separate event).
- Export logs and verify integrity markers if provided.
- Request a per-document report: full chain from creation to delivery.
Also check time synchronization. If clocks drift between servers and workstations, event timestamps jump. Require a unified time source and synchronization for ECM, WORM storage and infrastructure.
Typical mistakes when choosing an archival solution
The most common cause of project failure is fuzzy expectations. Archive solutions usually have two parts: ECM for rules and processes, and storage for preservation and WORM. If you collapse this into one vague requirement like "system must guarantee immutability and fast search", there will be nothing measurable at acceptance.
Another trap: assuming full-text search will remain fast at any scale. Speed depends on indexed fields, OCR quality, where the index lives, and concurrent query load. What works on 50k files may stall at 50M.
Test sets often are too small and uniform: only one PDF format, no scans, no emails, no large attachments. Pilot looks fine, but important edge cases surface after launch.
Check you don’t fall into these errors:
- requirements written in general terms with no acceptance criteria (seconds, volumes, roles, audit events)
- tests performed only in "good weather": no peaks, no network degradation, no node failures
- dataset unrepresentative of reality: few formats, few sizes, no bad scans
- WORM checked only verbally, without attempts to replace files, change metadata or erase traces
- no versioning and correction rules defined
Simple example: accounting needs to correct the "counterparty" attribute after upload, while file content must remain immutable. If you don’t agree in advance what may be changed in ECM versus what WORM blocks, you either lose manageability or break immutability requirements.
Short checklist before procurement and acceptance
To avoid disputes between vendors, fix measurable criteria and verify them in a pilot.
What to check in a pilot before signing
First agree numbers: user counts, concurrent queries, document types (scans, office files, OCR'ed PDFs), and most-searched fields. Then verify four things: search speed (by fields and text) at real volume; delivery (open, download, preview) under peak without timeouts; immutability (attempts to change, backdate or delete are blocked by rules); audit and recovery (who did what is visible, and after a failure data and logs remain consistent).
What must remain in acceptance documents
After tests keep evidence: scenario protocols (what was done, on which data, with which roles); measurement results (times, errors, conditions); WORM confirmation (what is blocked and how tested); sample audit reports and log retention rules; recovery plan (steps, timing, success criteria and responsible parties).
Practical example: an archive for an organization that faces inspections
A city hospital stores discharge summaries, protocols, patient consents and procurement acts. On a normal day it’s just an archive, but weekly there are urgent requests: a doctor needs to find a document in 2–3 minutes, and inspectors need a package quickly and proof that files haven’t been altered.
Roles were split so no one has everything. Doctors see only their department’s documents and can search and open them. The archivist handles intake, registration, retention rules and schedules. Security defines immutability policies and verifies no one can "clean up" traces. The administrator keeps availability, backups and updates, but cannot change fixed records.
A request is simple: the doctor enters IIN/Full name, visit date and document type, gets a short list, opens the needed file and views the card with requisites. Search must work by metadata and text (if OCR is available), and delivery must not hang under peak load.
If a document needs correction, the old record is not overwritten. A new version or a new document is created with the reason for correction, and the audit retains the chain: who initiated, who approved, what changed, when and from which workstation.
Before go-live they run short checks: search and open speed on realistic volume (e.g., 1–3M files) with 50–100 concurrent users; attempts to change or delete a "locked" document and key metadata; audit verification (event exists, cannot be removed, export possible); recovery after failure (service restored, data intact, access unchanged).
Success is when search and delivery meet agreed thresholds, immutability works in practice, and audit shows a clear and complete action history.
Next steps: pilot, acceptance and avoiding responsibility gaps
Turn expectations into measurable checks. Not "fast" but "search 1M cards within 2 seconds." Not "reliable" but "cannot delete or replace a file before retention ends; attempts are logged."
Then the usual flow: document requirements and convert them into test protocols (delivery speed, WORM, audit, recovery); request a pilot and run scenarios on representative documents and roles (user, admin, auditor); test growth over 3–5 years (volumes, file counts, concurrent users, backup windows); prepare acceptance criteria (what counts as pass, which logs and reports attach, who signs).
At acceptance clearly split responsibilities. Otherwise, during incidents it’s hard to tell whether ECM, storage or integration failed. Typical responsibility zones: ECM (rights, workflows, metadata, search, reports), storage (WORM policies, protection from deletion and tampering, preservation and replication), integrations (ingest from external systems, conversion, digital signatures/timestamps if used), support (single contact, reaction times, log collection, investigation).
If a single vendor is required to cover infrastructure, servers and integration in one contour, in Kazakhstan projects often consider GSE.kz (gse.kz) as an option: the company manufactures servers and workstations locally and provides integration and support. This helps reduce risk of responsibility gaps between multiple suppliers.
FAQ
Why separate responsibility between ECM and storage?
From the start, fix that ECM is responsible for meaning and user workflows: cards, metadata, rights, versions, routes and document delivery. The storage is responsible for file preservation: physical placement, copies, integrity control, WORM and recovery. If you don’t separate this, at acceptance it will be hard to prove who’s to blame for slow search, missing logs or gaps in immutability.
Which documents make sense to put into WORM and which not?
By default WORM is applied to final, legally significant versions: contracts, invoices, HR and medical records, protocols and acts. These are cases where you must prove the file existed in a specific form on a specific date. Drafts and working versions are usually kept in a normal mode so corrections remain manageable.
What exactly must be protected for immutability to be real?
Treat a change not only as overwriting a file or deletion, but also as edits of key metadata that alter the document’s meaning: number, date, counterparty, case membership and links. Also protect versions, history and any signatures/timestamps if used. A good approach is to forbid backdated edits and record corrections as a new version or a new document explicitly linked to the original.
Which WORM tests should be done before procurement?
Test both contours: via the ECM interface and directly at storage level. Try replacing an attachment, re-uploading a file, deleting before the retention period ends, weakening retention, and restoring from backup over the same identifier. In each case the system must either block the action or convert it into a correct procedure (new version) and always log the attempt.
How to fairly test search and document delivery performance?
Start with measurable figures: p50 and p95 for search and delivery times, at a fixed data volume and number of concurrent users. Test metadata search and full-text search (after OCR/indexing) separately. Split delivery into opening the card, preview and downloading the original. Repeat measurements in warm and cold states (for example after restarting services) to see differences.
What should the test document set for a pilot look like?
The minimal pilot set should resemble reality by "dirtiness": scans of varying quality, PDFs with and without searchable text, office files, emails, different sizes and ages. Include incomplete and erroneous metadata—this is how it will be in production and often breaks search. If the test set is too "perfect", the pilot will look good but surprises will appear after launch.
Where should full-text search live and who is responsible for the index?
Locate the index where it is actually built and updated: it can be ECM, a separate search engine, or part of the storage platform, but responsibility must be clearly assigned. Specify who keeps the index current, time to visibility in results and the reindexing procedure after failures. Without this, a file may be uploaded but not findable, and parties will start shifting blame.
How to understand that the system audit is truly reliable?
Logs should answer the basic questions: who, what, when, where and with what result for viewing, downloading, printing, metadata and rights changes, and admin actions. Verify that turning auditing off is itself recorded and that logs cannot be quietly deleted or overwritten. Also ensure consistent time synchronization across ECM, WORM storage and infrastructure; otherwise event chains will be misleading.
What are the most common mistakes when choosing an archival solution?
Most failures start with requirements written without numbers and tests: “fast”, “reliable”, “immutable” but no thresholds for time, volume or roles. A second common mistake is checking WORM only verbally without trying to actually replace the file, metadata, or erase traces. Third, pilots are run on small, uniform datasets without peak loads or failure scenarios.
Is it possible to have one contractor cover archive, infrastructure and integration without losing control?
If you need a single contour covering servers, infrastructure and integration, choose a scheme with one defined responsibility zone and one set of acceptance criteria. In Kazakhstan such projects often involve GSE.kz (gse.kz) as a server manufacturer and integrator to avoid splitting disputed issues between multiple vendors. Even then, clearly document boundaries: what ECM is responsible for, what the storage is responsible for, and which tests validate WORM, performance and audit.