Apr 07, 2025·7 min

Test and Proctoring System: Attempts, Question Bank and Security

Test and proctoring system: how to configure attempts, build a question bank, set up anti-cheat, store evidence and handle load.

Test and Proctoring System: Attempts, Question Bank and Security

Why proctoring is needed and what to design

Proctoring is needed where a test result affects job access, certification, hiring or compliance. Typical online testing faces three problems: outside help, searching for answers on the internet and identity substitution. A testing and proctoring system reduces these risks and provides verifiable evidence if the result is disputed.

Designing the system is not just about “watching through a camera.” First decide the level of trust you need and what you are willing to trade off: convenience, time, budget, or device requirements. For internal employee assessments softer measures are often enough, while high-stakes exams require a stricter setup.

A minimal set typically includes identity verification (document, photo, template comparison or manual review), timestamped event logging (start, pause, finish, retake), environment controls (camera, microphone, fullscreen mode, copy prevention where possible), rules for responding to violations (warning, flag, abort attempt) and secure storage of results together with the verification record.

Online and offline formats change requirements. Online, device and network checks, consent for data processing and fast procedures for “what if the connection drops” are critical. Offline, the main risk shifts to audience control and device accounting, but video quality and identification are easier.

From the start, log not only the score but the full “exam history”: tab switches, window focus losses, camera disconnects, suspicious pauses, entries from other devices, proctor actions and reasons. Disputes are then resolved by facts, not impressions.

Roles, scenarios and requirements before development

Start with agreements, not screens: who is responsible for what, what counts as a violation and which actions the system takes automatically. This saves weeks of rework when it turns out that “pause” is forbidden or appeals must retain evidence.

Common roles: candidate (or employee), proctor, administrator, question author and reviewer. The participant takes the test and confirms identity. The proctor observes and logs events. The administrator configures exams, schedules, access and permissions. The author manages the question bank, tags, difficulty levels and versions. The reviewer grades open answers and handles appeals.

Describe scenarios as sequences of steps, not a feature list. For admission: registration, identity check, equipment check, consent to rules. At start: issuing the attempt, starting the timer, recording conditions (browser, device). During an allowed pause: what happens to the timer, whether device changes are permitted and what the proctor sees. At finish: autosave, submitting answers, result protocol. For appeals define deadlines, who sees recordings, how grades can change and who approves the final decision.

Non-functional requirements should be agreed early because they directly affect architecture and cost:

  • availability and recovery plan (what to do if a failure happens during an exam)
  • performance (time to open the test, save answers, proctoring latency)
  • simplicity (minimal steps before start, clear error messages)
  • observability (event logs, metrics, reports for investigations)

Agree with legal and security teams before development: what data and media are collected, retention periods, who has access, where recordings reside physically, how consent is captured, what counts as a violation and what sanctions are allowed. This is crucial for organizations with strict personal data rules and internal regulations.

Attempt model: rules, timers, retakes and access

The attempt model sets the tone for the whole exam. It should support fairness, reduce stress and be resilient to real failures.

Basic rule: a single attempt is suitable for final certification, multiple attempts for learning. If multiple attempts are allowed, decide beforehand which result counts: best, last or average. This affects motivation and preparation.

Make retake rules clear and predictable. Common practices include limiting attempts (e.g., 2–3), setting a delay before a retake (e.g., 24 hours), allowing retakes only after training or review, a fixed closing date for all retakes and a separate rule for technical failures (a free retry).

Often one timer for the whole test is sufficient. Per-question timers help prevent hints and long pauses but increase stress and complicate disputes. A compromise is a global timer plus limits on breaks.

Plan progress saving for connection loss. A practical approach is to save each answer and the current timer state on the server so the participant resumes the same attempt rather than starting over. Log events like connection loss, device change and re-entry.

Access policy should be manageable: time windows (e.g., 09:00–18:00), invitations by list, groups (department, branch, course) and modes like scheduled or on-demand retakes.

Question bank: structure, tags, versions and test assembly

A good question bank is a managed knowledge base, not a folder of files. It serves two goals: fairness through variability and predictability through quality rules.

Start with a small set of types: single and multiple choice, numeric or short text input, matching, and case questions where the reasoning matters. Store not only the question text but an explanation of the correct answer. This helps with appeals and error analysis.

Structure the bank by topics and tags. Topics work for curricula, tags for search and assembly (for example: “security”, “Windows”, “network basics”). Add difficulty level and versioning: when a question changes, keep the old version so past results can be interpreted correctly.

Randomization should be controllable. Shuffle question and answer order, but enforce quotas: e.g., 40% topic A, 30% topic B, 30% topic C. Tests will be different but comparable in difficulty.

When assembling a test, separate a mandatory block (critical knowledge) and a variable part (drawn from pools by rules).

Quality control: regular mini-audits to find duplicates and near-duplicates, mark outdated items (products, regulations, dates), validate that questions measure the target skill, track questions that are too easy or too often failed, and review language and ambiguity.

Grading and evaluation: rules and transparency

Grading is where disputes most often arise. Describe rules so participants, reviewers and admins understand them and the outcome can be confirmed by logs.

Start with a scoring scheme. Decide if all questions have equal weight, whether there is a penalty for guessing, and what a passing threshold is. Put settings in the test: question weight (or block weight), penalty for wrong answers (e.g., for multiple-answer tasks), rounding rules (points and percentages), passing threshold (and retake conditions), and rules for skipped questions (0 points or partial credit).

Partial credit is needed where answers have multiple parts: multiple selection, matching, numeric ranges, case steps. For cases define scoring criteria: what gains points, what deducts, and which answers are equivalent (different wording, same meaning). This reduces variability between graders.

Separate auto-graded and manual-graded items: automate everything that can be checked unambiguously by keys or formulas. Send anything requiring interpretation (essays, open cases, code without strict tests) to manual review with a rubric and mandatory comments.

To avoid endless appeals, set “anti-dispute” rules: which answer is final (the last saved before the timer expired), how to handle two correct variants, how to treat partially filled fields and how to account for technical failures (for example, by server logs rather than participant statements).

Decide what the participant sees in reports. It’s common to show status (pass/fail), overall score and time immediately. Later show topic breakdown, reviewer comments and reasons for penalties. Limit full correct answers and question wording to avoid leaking the bank. For appeals provide rule excerpts and log references without exposing unnecessary personal data.

Proctoring and anti-cheat mechanics without excessive paranoia

Pilot without extra risks
We will run a pilot with 30–100 participants and prepare a scaling plan.
Launch a pilot

Anti-cheat starts with the question: what exactly do you consider a violation. For some exams prohibiting hints and copying is enough; for others you must confirm identity and exclude an off-camera helper. Decide in advance which events you can prove: not “looks like cheating” but “video + window focus log + sudden answer spikes.”

Basic measures often outperform total surveillance. Controlled randomization, shuffled options, different sets per attempt, timers and simple restrictions like copy control and fullscreen exit prevention reduce the temptation to search for answers. These are inexpensive, less annoying for honest people and work in most contexts.

Enable proctoring based on risk. Risk scoring—collect signals and raise flags only when there is reason—is often better than blanket recording.

Examples of clear, provable flags:

  • frequent tab switches or loss of window focus
  • repeated connection drops at key moments
  • sudden change in answering pace (too fast on hard items)
  • mismatch between the face and the ID photo
  • suspicious sounds or a second voice (if microphone is used)

Have a review procedure, otherwise proctoring becomes a source of conflict. A short regulation helps: who can view recordings and logs, review deadlines (for example, 24–72 hours), decision scale (accept, retake, void), participant right to comment and appeal, evidence retention periods and restricted access.

Example: for employee certification in a government agency you might record the screen and take occasional screenshots for everyone, but enable camera video only for final modules or high risk-score cases. Control stays strict without turning into blanket surveillance.

Storing results and evidence: logs, media, access

Trust in an exam relies not on a pretty UI but on the ability to quickly and fairly explain what happened and why the outcome is what it is. Store results so they can be verified, but avoid collecting unnecessary data.

Record context along with answers. The same answer can be correct in one question version and wrong in another, so save a snapshot of the test the participant saw.

The minimum that usually helps in disputes:

  • participant answers and final score (with breakdown for partial credit)
  • versions of questions and options the participant saw (not the current bank state)
  • attempt events: start, pauses, connection drops, tab switches, warnings
  • proctor decisions and reasons (timestamps, comments, final status)
  • technical parameters: device, browser, IP, connection quality (without trying to “track everything”)

Store proctoring artifacts (camera video, ID photo, screenshots, screen recording) according to sufficiency. Often saving short clips around suspicious events is more useful than recording full streams.

Define retention in the exam rules: e.g., results 1–3 years for audits, media 30–180 days, extendable on participant complaint. Automate and audit deletions so it’s clear data was removed per policy, not accidentally.

Restrict access: reviewers usually need answers and event logs; video is available only for appeals or internal investigations.

Keep an immutable audit: who viewed media, who changed an attempt status, who recalculated a score. A practical approach is an append-only journal that cannot be silently edited (for example, hashed entries).

Example: in a Kazakh government setting HR sees the outcome and comments, the manager sees competency scores only, and security gets video access by request for a limited time. This reduces conflicts and protects both participants and committees.

Scaling and reliability under load

Evidence and access
We will configure log and media storage with access control and clear retention times.
Discuss storage

Peak load usually happens when hundreds or thousands start an exam in the same slot. The first things to break are not graders but the edges: authentication, test issuance (assembling variants), page load and telemetry ingestion. If video is being written at the same time, network and storage become bottlenecks fast.

Plan for queues and concurrency limits. Critical actions (start attempt, save answers) must always succeed; heavy tasks (video processing, recognition, report compilation) can be deferred by seconds or minutes.

High-impact practices:

  • limit simultaneous attempt starts and release them in small batches
  • accept proctoring events via a queue with the option to upload later
  • cache static test parts (texts, images) so the database isn’t overloaded
  • implement graceful degradation: disable secondary checks under overload but never break attempt completion

Reliability is easier when components are separated: test service, proctoring, media storage and analytics each in their own service. Video-processing failure should not prevent saving answers or finishing the exam.

Plan recovery: an offline client or lost internet should not burn the attempt. Store state stepwise, save answers as entered and allow a secure reconnect with the same timer. Large organizations often maintain a backup contour and a regulation for who decides about session rescheduling.

Observability matters more than pretty charts. Minimum metrics: time to issue a test, API error rate, queue lengths, database load, media write speed, reconnection rate, number of stuck attempts. Alert on what affects the ability to complete exams now.

Step-by-step rollout plan: from pilot to production

Move from idea to a working system in small steps. Problems usually appear not in code but in rules, communication and support.

A working plan:

  1. Fix exam rules: who is eligible, number of attempts, what counts as a violation, timer behavior, and when attempts are void. Prepare a simple risk matrix: what can go wrong (identity substitution, hints, connection loss) and how you will respond.

  2. Prepare the question bank and formatting standards. Use a template: unambiguous wording, a single correct answer where appropriate, clear criteria for open tasks. Agree on tags (topic, difficulty, role) and versioning so you can later explain why two people saw different variants.

  3. Run a pilot with a small group and measure problems. Track not only scores but metrics: how often connections drop, which questions cause disputes, false anti-cheat triggers.

  4. Set up proctor and appeal processes. Proctors need clear guidance: what is a violation, when to warn and when to end an attempt. For appeals define deadlines, materials to include (logs, recordings, screenshots) and access roles.

  5. Before a full launch, run a load test. If 5,000 people will take an exam in one hour, test authentication, question issuance, answer saving and media recording at peak, with margin. If deploying in your own environment (e.g., in a government org) validate servers, storage and a failover plan.

After launch don’t freeze rules. The first 2–3 sessions almost always produce a list of small fixes that reduce disputes and increase trust.

Typical example: certification of employees and candidates

A company runs mandatory certification for employees on information security and internal rules, and a short entry test for candidates before interviews. Results must be comparable and retakes should not become a way to memorize answers.

Attempt model: one main attempt within a fixed access window (for example, 72 hours from assignment) and one retake only after failure and after 7 days. The timer is 45–60 minutes with autosave. Candidate rules can be stricter: one attempt, short window and mandatory identity check.

Test assembly: 40 questions from the bank with quotas by topic so each variant is balanced—for example, 15 security, 10 access and passwords, 10 data-handling rules, 5 incidents. Within quotas randomize and shuffle options, and version questions so updates don’t break past attempts.

Proctoring without excessive surveillance: start with identity check (document + selfie or short video), then selective recording based on flags. Flags include frequent tab switching, suspicious pauses or sudden answer speed spikes. This reduces storage and review load while preserving evidence when needed.

Managers and security usually want reports in several views: summary by unit (who passed/failed, retake dates), score distribution and weak topics, attempts with proctor flags and short comments, HR export (pass/fail and next retake date), and an action log (who assigned tests, changed rules, viewed recordings).

If many certifications are run and data sovereignty is required, infrastructure is often deployed in the organization’s own environment. In such projects a system integrator like GSE.kz (gse.kz) may handle server selection, deployment and 24/7 support so exams don’t fail on peak days.

Common mistakes and pitfalls at launch

Load and resources estimation
We will estimate capacity for peak starts, media recording and evidence storage.
Request a calculation

The main trap is building the testing and proctoring system as a “trap for cheaters” instead of a clear exam. That increases conflicts, lowers trust and floods support with complaints.

Frequent causes of problems:

  • aggressive anti-cheat: excessive bans (any glance away, room noise, OS notifications) cause many false positives and disadvantage honest participants
  • questions without versions: wording or answer keys change and past results become inexplicable
  • no appeal procedure: unclear who reviews disputes, what counts as evidence and how recordings are stored and accessed
  • media stored “as is”: screen and camera recordings left without clear access rights, retention schedules or audit trails
  • underestimated load: no peak tests, no plan for simultaneous starts, connection drops and repeated logins

A sanity check: if you can’t explain to a participant in 2–3 minutes why a violation was recorded, the rules are too vague. And if you cannot reconstruct the exact test version (questions, order, correct answers) for a specific attempt, you will probably lose disputes about results.

A small example: candidates start at 10:00. At 10:02 many see “suspicious activity” triggered by system notifications, and admins have no rule on whether a retake is allowed. The exam’s reputation suffers even though the issue was configuration, not people.

Fixes are simple: gentle anti-cheat thresholds and clear warnings, strict question versioning, an appeal regulation, controlled media access and retention, plus a mandatory load test before the first mass launch.

Short checklist and next steps

Check basic hygiene before launch. If these are not fixed, users will argue about rules rather than knowledge.

Pre-launch checklist

  • attempt rules: allowed attempts, intervals, what counts as a used attempt, how to handle connection loss
  • timers and access: test duration, access windows, time zones, whether pauses are allowed
  • test assembly: question bank ready, randomization enabled, comparable difficulty, question versions locked
  • roles and permissions: who creates tests, who assigns them, who views results, who can change settings
  • reports: final score, topic breakdown, reasons for failing, HR or learning export

Proctoring and data checklist

Decide up front what you record and why. Simpler, clearer rules mean fewer conflicts.

  • proctoring: which signals are violations, who decides, how a “disputed case” is documented
  • evidence: what you store (video, screenshots, logs), retention times, who can access
  • privacy: consents, notifications, data minimization, access audit
  • appeals: deadlines, who reviews and what materials are shown to the participant

Next steps: run a pilot with 30–100 people, measure failures (network, devices, browsers), review high-guess questions, adjust thresholds and reports. After the pilot run a load test and prepare fallback scenarios (queuing, re-login, resume attempt).

FAQ

When is proctoring really necessary, and when can it be skipped?

Proctoring is needed when test results affect job access, certification, hiring or internal compliance. It reduces the risk of hints, searching for answers during the exam, and identity substitution, and provides verifiable evidence if the result is disputed.

What should be designed in a testing and proctoring system besides the camera?

Start by defining the required level of trust: what you consider a violation and which risks must be provable. Then design identity checks, timestamped event logging, environment controls (camera/microphone/fullscreen where appropriate), reaction rules for violations and secure storage of results together with an audit record.

How to organize identity verification so it is reliable but not too complicated?

Usually a document plus a selfie or a short video is enough, combined with an automatic template comparison and a manual review option for edge cases. Record who and when confirmed identity, and store that information with the attempt so the decision chain can be reproduced later.

Which timers are better: for the whole test or per question?

A common approach is a single overall timer for the whole test with autosave and clear rules about what counts as the final answer. Per-question timers make sense only for high-stakes exams because they increase stress and complicate dispute resolution.

How to set up retakes so they don't turn into memorizing answers?

Make rules clear: limit attempts, set delays between retakes, define eligibility (for example, after training), and have a specific policy for technical failures without penalty. Decide in advance whether the final result is the best attempt, the last attempt or the average, and record that in the exam settings.

What to do if internet drops or a tab closes during the exam?

Save every answer and the timer state on the server as the participant progresses so they can continue the attempt after returning. Also log connection drops, re-entries and device changes so appeals rely on logs, not verbal explanations.

How to build a question bank so tests stay comparable and don't break after updates?

Treat the question bank as a managed knowledge base: topics and tags for test assembly, difficulty, an explanation for the correct answer and strict versioning. Keep old versions when wording or keys change so past results remain explainable.

How to create different test variants while keeping equal difficulty?

Use controlled randomization with quotas by topic and difficulty so variants are different but equal in weight. Also shuffle question and answer order, and keep a mandatory block of critical knowledge that everyone must pass.

Which anti-cheat mechanisms are effective without causing conflicts for honest participants?

Start with simple, verifiable signals: tab switches, loss of window focus, connection drops at key moments, identity mismatch, suspicious pauses. More important than catching everyone is having a review procedure: who views records, within what timeframe, possible outcomes and how the participant can appeal.

What must be stored for evidence, and how to survive peak load during exams?

Store not only the score and answers but also a snapshot of the test (question versions and options), the attempt event log and proctor decisions with timestamps and reasons. Keep media by the principle of sufficiency and with clear retention policies and access audit. Architect the system so video processing does not block answer saving; for some projects, infrastructure and support are deployed in the organization's environment (for example, GSE.kz).

Test and Proctoring System: Attempts, Question Bank and Security | GSE