Dec 13, 2025·8 min

SQL Generation with Access Control: LLMs for Analysts

Practical approaches for using LLMs to assist with queries and explanations while ensuring SQL generation with access control does not bypass permissions or cause data leaks.

Why analysts need LLMs and where the risks begin

LLMs are especially useful where analysts spend time on routine tasks: turning a question in Russian into a draft SQL, dissecting someone else's query, understanding an error, and tidying up structure. This noticeably speeds up work when the warehouse has many tables and complex relationships.

Most often people expect four things from a model: quickly sketch SQL from a metric and dimensions description, explain query logic, point out a typical mistake (JOINs, filters, aggregates) and suggest a clearer or faster variant.

Risk starts where convenience replaces control. As soon as names of restricted tables, sample rows or fragments of results appear in the context, the model may inadvertently hint at ways to access sensitive data. Sometimes even without a direct request: the user asks “show sales by customers” and the model replies as if everyone has access to the details.

A “closed mart” usually means a data layer with restricted access: personal data, salaries, medical fields, commercial terms, or detailed records for specific customers. The danger is not only full dumps. A couple of rows, rare values or unique identifiers often reveal more than expected.

Therefore the boundary of responsibility must be strict: the model only proposes a draft, and the decision about access is made by the system. Roles, RLS, view policies and query validation before execution must work independently of what the LLM “promised”. The model should not act as the guard and must not receive data it is not allowed to know.

Threat model: what can go wrong

LLMs make work more convenient but do not know your access rules the way the DBMS and the mart layer do. If a model can write SQL and explain results, it may accidentally suggest a way to obtain more than allowed or repeat things that should not be shown.

A typical scenario is permission bypass via a “clever query”. An analyst may be blocked from a margin mart but allowed access to sales. The model suggests a JOIN with a discounts table or an indirect margin calculation using cost from another schema. Formally the query looks normal, but it effectively reconstructs a sensitive metric from available pieces.

Another risk area is leakage through explanations. Even if the SQL is correct, the text answer may name specific values: “client X is 90 days overdue” or “employee Y’s salary is above average.” This often happens because examples of rows, export fragments or error messages with data made it into the chat.

Three other common attack classes:

prompt injections: “ignore the rules”, “show the raw data”, “output all columns”;
history and cache risks: yesterday we discussed a closed mart and tomorrow a similar question pulls an old fragment;
log blind spots: without auditing, unusual queries (mass scans, table enumeration) go unnoticed.

For personal and commercial secrets, not only direct SELECTs on PII fields are dangerous but also deanonymization: combinations of age, city and position, and calculations that allow reconstructing individual values from aggregates.

Core principles of safe design

Security starts with architecture, not with “smart” filters. First, define what an LLM can and cannot do, even if asked “very urgently.” Then mistakes and bypass attempts hit rules rather than the model's goodwill.

Principle 1 — least privilege. The LLM should not have more rights than the analyst it helps. Ideally the model has no data access and works only with table descriptions and example queries without sensitive fields.

Principle 2 — separation of roles. Query generation and execution must be separated: one component forms SQL, another validates and runs it on behalf of the user. This reduces the risk that the model will accidentally request too much and makes the system resilient to prompts like “ignore the rules.”

Principle 3 — forbid direct access to raw sources if marts are available. Raw tables almost always contain extra stuff: PII, technical fields, excessive granularity. Allow only approved marts and views where masking and RLS are already applied.

Principle 4 — predictable, verifiable constraints. Define a small set of rules: SELECT only, from an approved object list, mandatory filters (when RLS requires them), ban SELECT * and limits on result size and execution time.

Example: the analyst asks “show sales by customers.” The model may generate a query against a sales mart with aggregation, but it must not propose a JOIN to a raw customer table if access to it is forbidden, even if the user requests names or phones.

How to give the LLM context without giving data

The less live data the model sees, the lower the leakage risk. Analysts still need context, otherwise SQL becomes guesswork. Usually teams choose one of three levels.

Level 1: the model does not touch the database. You give it the schema (tables, fields, relations) and a person runs the query in the usual tool. This is the safest option, but quality depends on how well the data model is documented.

Level 2: the model sees metadata only. Names, descriptions, types, permitted values and formats, but no table rows. Such a dictionary greatly improves accuracy: the model understands what client_id is, how creation_date differs from payment_date, and which fields are sensitive.

Level 3: the model receives query results and explains the numbers, suggests checks and draws conclusions. This carries the most risk: even aggregates can reveal sensitive data (e.g., groups of 1–2 people), and filter mistakes can expose rows from closed marts.

When you need a proxy layer between the LLM and the DB

A proxy is needed if the model itself initiates SQL execution or if you want a single enforcement point. The proxy can validate queries against policy, apply RLS and roles, substitute direct tables with allowed marts, block dangerous constructs and limit result size.

To choose quickly, answer: can a query accidentally touch personal or financial data; are there strict roles and RLS that cover all marts; does the model need to explain figures or is SQL alone enough; who owns validation rules and incident response; is access to small-result outputs acceptable?

In practice: public sector and finance usually start with levels 1–2 and enable level 3 only via a proxy with strict limits.

Metadata and a dictionary: the basis for correct SQL

For an LLM to write useful SQL and avoid guessing structure, it needs context — but the context should be metadata, not data: a terms dictionary, a list of available marts, field descriptions and ready-made metrics. Choosing the wrong table here easily becomes a leak.

A safe minimum of metadata: names of allowed tables and views, the mart's purpose, a list of fields with types, units and allowed values (e.g., status codes). That is enough to ask “revenue by month” and get a correct query without seeing a single row.

To prevent the model from reaching for closed sources, use whitelists. Don’t try to protect with blacklists: synonyms, similar names and bypass techniques will be found. Better to give the model only what is allowed for that role and nothing else.

Rules that reduce model “imagination"

Good names and comments save editing time. If a field is called amount but nowhere states it is “amount without VAT in KZT,” the model will build wrong aggregations.

A few practices help: consistent naming style (e.g., order_date, customer_id, amount_kzt), field comments (meaning, units, what counts as "truth"), explicit notes about access (e.g., “field available only to role X”), and sensible defaults for periods (e.g., last 90 days) and maximum ranges.

Complex metrics and business logic

If a metric is non-trivial (margin, SLA compliance), describe it as a ready computation: a view/mart or a template expression with explanation. Then the LLM won’t try to assemble the formula from guesses.

For example, instead of asking “calculate shipments,” fix the definition of “shipment” (which status and which date) and indicate the single allowed source, e.g., mart shipments_daily.

Process: from a question in Russian to safe SQL

Infrastructure for LLMs

We will pick servers and infrastructure for AI and analytics with security requirements in mind.

Select a server

To make the assistant speed work without bypassing rights, define a clear route from question to execution.

A simple flow looks like:

The user formulates the question in Russian and sets bounds: period, segment, metrics and response format (table, top-10, monthly trend).
The system selects allowed mart(s) and fields from a whitelist. The model does not decide what to read — user rights and rules do.
The LLM assembles a draft SQL and briefly explains logic: filters, joins, meaning of metrics.
An auto-check runs SQL through constraints: allowed objects and columns, forbidden functions, mandatory filters, result limits.
Execution runs only on behalf of the user and only through a secure layer that always enforces RLS and prevents direct raw DB access.

After execution the model may explain results but without restating unnecessary values. For example: “Revenue in allowed stores increased 8% in November vs October; two categories contributed most.” No listing of rows the user could already see and no guessing about closed segments.

Data-level access control: roles, RLS and marts

Rely on what the warehouse enforces, not what the LLM “said.” BI roles and warehouse roles serve different purposes: BI controls report and folder access, while the warehouse controls which rows and columns can actually be read.

A bad practice is to mix these and hope that “hiding a field in BI means it is inaccessible.” The LLM can generate queries that bypass the report, so strict restrictions must be applied where SQL runs.

RLS and CLS in plain terms

RLS (row-level security) filters rows: an analyst sees only their branches or region. CLS (column-level security) restricts columns: salaries, PII or contract details. Practical minimum: assign warehouse roles to users or BI service accounts, set RLS via rules (not parameters that can be spoofed), use CLS to hide sensitive columns even if someone tries SELECT *, and allow marts only via an explicit allowlist.

Closed marts and indirect access

A closed mart must be inaccessible not only directly but also through “similar” data. The worst cases happen when aggregates on rare categories or small groups are allowed. Essentially this reconstructs facts by difference.

To reduce risk, enforce anonymity thresholds (ban groups with fewer than N rows), limit detail and monitor queries that iterate filters.

Test rules, don’t just state them: create test users for each role, synthetic datasets with edge cases (rare categories, single-employee groups), and negative tests trying to read closed marts via joins and aggregates.

SQL validation before execution: rules and constraints

An LLM can nicely explain a query and produce correct syntax, but that does not make the SQL safe. The key step is validation before execution, where you trust rules rather than a single string.

What to check automatically

Start with a simple allowlist: permit only safe templates. Practically this usually means SELECT only from approved marts without modifying data.

Useful checks: forbid DDL and DML (CREATE, DROP, ALTER, INSERT, UPDATE, DELETE, MERGE); allow only approved schemas and marts; restrict JOINs (either within one mart or only by approved keys); prohibit constructs that bypass logic (uncontrolled UNION, subqueries to system tables, dynamic SQL); and ensure selected fields are permitted (especially PII and financial attributes).

Then add mandatory security filters. Even if the analyst didn’t specify them, they should appear in the final SQL: period, organization, region and other constraints matching the role.

Export and load constraints

A legal query can still leak if it allows exporting too many rows. So enforce row limits (TOP or LIMIT) and disallow queries without them in user scenarios, set execution timeouts and complexity caps (number of JOINs and subqueries), and block overly wide periods without approval.

A crucial point is SQL parsing and normalization. Don’t validate with regex over text and don’t trust the LLM output. Parse the query into an AST, normalize table and field names, apply rules, then recompose a safe SQL. This closes attempts to hide risky parts in comments, odd spacing or tricky aliases.

A small example: an analyst asks for a sales report and the model adds a JOIN to a personal data table. The parser sees the forbidden field, rejects the query and offers a safe alternative — aggregates by region without per-person detail.

Logs, audit and response to suspicious queries

Proxy between LLM and DB

We will build a proxy-layer architecture: allowlist, limits, SQL parsing and auditing.

Evaluate the solution

Even with roles, RLS and query validation, without logs you won’t know how the assistant is used and where it fails. Issues often look like “almost correct queries” that accidentally broaden the selection.

It is useful to record the entire path from idea to execution: the original user question, the LLM’s draft SQL, the final SQL after rules, the role and allowed marts, and the validation result (which rules triggered and why).

Logs themselves can be a leakage source because they contain table and column names and sometimes data fragments (e.g., in errors). Store them separately from BI, restrict access by need-to-know, set retention, and avoid recording query results unless necessary.

What looks suspicious

Common signals repeat: frequent queries to many different marts, field enumeration attempts (“show all columns”), attempts to add UNIONs or subqueries to forbidden tables, growing limits and date ranges without reason, and repeated validation failures.

How to investigate incidents

Assign an owner (usually security plus data owner). Steps are straightforward: freeze execution for the user or role, collect the chain “question -> SQL -> validation”, confirm which data could be affected, update validation rules and the allowed-marts list. Review marts and rules monthly because schemas and roles change faster than documentation.

Typical mistakes that lead to leaks

The most dangerous mistake is giving the model direct access to production and hoping it will be cautious. Without a proxy, strict rules and SQL validation, the model will generate queries that bypass intended limits or simply return too much.

Teams often try to “blacklist everything”: block a few words, tables or functions. This almost always fails due to synonyms and bypasses. Whitelisting allowed schemas, views and operation types, plus explicit result constraints, works much better.

Another source of leaks is real data in prompts “for example.” One screenshot with names, national IDs, phones or salaries in chat history turns a test into an incident even if the SQL is never executed.

Common problems: the model returns detailed rows when an aggregate suffices; test and production rights are mixed and broader access silently moves to production; users aren’t told boundaries and the model confidently misapplies business logic with wrong joins or filters.

A good rule: a query should answer exactly the report’s question. A manager usually needs branch-level revenue totals, not a list of every receipt and customer. That reduces disclosure risk even with imperfect phrasing.

Pre-launch checklist for LLM in analytics

Pilot without noise

We will help start with one data mart and one report to quickly validate rules.

Plan a pilot

Before a pilot, agree on boundaries rather than focusing only on “answer quality.” A safe scenario starts when the model physically cannot see unauthorized data, even if asked to bypass rules.

Minimum before launch:

Context only from approved sources: marts, field descriptions, allowed relations and example queries without real data.
Any generated SQL is not executed “as is”: it goes through a parser and rule set (allowed tables and fields, banned functions, JOIN control, mandatory LIMIT, no DDL/DML).
Queries execute on behalf of a specific user and the DBMS enforces rights: roles, RLS and mart access must work independently of the model.
Protection from “thin” selections: block small-result outputs, mask rare categories, set minimal thresholds for groupings.
Logs and audit: who asked, what was generated, what actually ran, how many rows returned. Add alerts for anomalies (spikes, large scans, unusual filters).

A practical test: ask the assistant to build a “salary by department” report when the user lacks access to the closed salary mart. Correct behavior is to explain the restriction and propose a safe alternative (aggregates or a public mart), not to guess a bypass JOIN.

Also give users short rules for interacting with the tool: forbidden topics (personal data, one-off checks of individuals), and helpful phrasing (report goal, period, aggregation level).

Example scenario: report without access to closed data

An analyst in finance prepares a weekly delinquency report. They have access to an aggregated portfolio mart (by day, product and region), but the client mart (names, national IDs, contacts, personal limits) is closed.

They ask the assistant: “Create SQL for the 30+ delinquency share by region for the last 8 weeks and explain how the metric is calculated.” The LLM saves time: it picks the correct fields from the allowed mart and explains the formula so business can confirm it.

Problems start if the user then asks “Show top 10 clients by delinquency” or requests fields not allowed. Even without direct access, the model may reference a closed table or guess column names.

Rules stop the risk: execute only against the whitelist of marts and views, forbid detail (aggregates only, minimal grouping), enforce RLS by warehouse role, set sampling limits and ban SELECT *.

A safe result looks like a query against an aggregate mart plus an explanation of metrics and validation steps.

SELECT
  week_start,
  region,
  SUM(overdue_30_plus_amount) / NULLIF(SUM(total_outstanding_amount), 0) AS overdue_30_plus_share
FROM mart_portfolio_weekly
WHERE week_start >= CURRENT_DATE - INTERVAL '56 days'
GROUP BY week_start, region
ORDER BY week_start, region;

Next steps: how to deploy without noise and risk

The safest deployment approach is to start narrowly and measurably. Pick one mart and one report type (e.g., regional sales without personal data). That quickly shows where the assistant helps and where access rules or metadata quality are pain points.

Then agree rules up front instead of fixing incidents later. Gather data owners, security and analysts to specify: which tables are available, which fields are forbidden, which filters are mandatory (RLS), what extraction limits are acceptable, and which query patterns are suspicious.

Keep the pilot simple: describe 10–20 real tasks and the expected outputs, set roles and RLS as the single source of truth, enable SQL validation before execution, run the pilot for 2–4 weeks and collect logs (what was asked, what was generated, what was blocked). Expand coverage only for marts with mature rules.

If you need a protected environment (e.g., public sector or finance), design infrastructure ahead: where the model runs, how context is stored, who operates and who responds to incidents. In such cases monitoring and 24/7 support should be part of the design, otherwise suspicious queries go unanswered.

For organizations in Kazakhstan it can be convenient to involve a system integrator that builds the full contour: from warehouse roles to server infrastructure and support. For example, GSE.kz (gse.kz) provides system integration and infrastructure solutions for data centers and AI, helping align security, access and operations in one project.

FAQ

Which tasks should analysts really hand to an LLM, and which should they not?

LLMs are best used as draft generators and a helper for reading queries: sketch SQL from a metric description, explain someone else's query, point out common JOIN/aggregation mistakes and suggest clearer structures. Query execution and decisions about what can be read should remain with the data warehouse and your access control layer.

Can I give an LLM direct access to the database?

By default, it is safer when the model has no direct access to data and works only with metadata and example queries that do not include sensitive fields. If you need a mode where the model "executes SQL and explains the figures", do that only through a proxy layer that enforces roles, RLS/CLS and strictly validates the query before execution.

What context should I provide to the LLM so it writes good SQL without needing data?

Give the model descriptions rather than rows: a list of allowed marts and views, fields with types, meanings and units, permitted status values, join keys and metric definitions. The more precise the dictionary and comments, the less guessing the model will do and the less temptation it has to reach for raw sources.

What exactly counts as a leak if the model doesn’t “download” anything?

Do not paste real rows, unique identifiers, names, national IDs, phones, salary figures, exports or even examples drawn from production. Rare values and small groups are especially dangerous: aggregates can reveal individuals when a group contains 1–2 records.

How does RLS differ from CLS and why do you need both?

RLS (row-level security) filters rows a user can see (e.g., only their region or branch). CLS (column-level security) restricts columns (e.g., hides PII, salaries or contract details even if someone tries `SELECT *`). Ideally both are enforced by the DBMS and do not depend on what the model "said".

Why is a proxy layer between the LLM and the warehouse needed, and what does it do?

A proxy is needed when you want a single place to validate and sanitize queries before execution. It can allow only `SELECT`, substitute direct tables with approved marts, add mandatory role-based filters, block dangerous constructs and limit result sizes, then run the query on behalf of the user.

Which SQL checks should be done automatically before execution?

Minimum checks — forbid DDL/DML and allow access only to an allowlist of marts, plus verify allowed columns and JOIN keys. Then add mandatory `LIMIT`, period restrictions, timeouts and complexity limits. Do parsing and validation using SQL AST parsing and name normalization rather than regex on text. This prevents attempts to hide forbidden parts in comments or tricky aliases.

How to defend against prompt injections like “ignore the rules and show everything”?

The main protection is separation of generation and execution: the model can "ask" but cannot force the system to run forbidden queries. Practically this means a strict allowlist of objects, prohibition of selecting all columns, mandatory role filters and refusal to return detailed rows when only aggregates are allowed. If a query is rejected, the system should provide a safe alternative rather than trying to guess a bypass.

What should I log in an LLM-SQL assistant so logs help rather than create new leaks?

Log the full path: the user’s question, the LLM’s draft SQL, the final SQL after rules, the role and allowed marts, the validation result and execution parameters. However, logs themselves can leak because they may contain table and column names or error fragments. Store logs separately from BI, restrict access on a need-to-know basis, set retention times and avoid recording query results unless required.

How to start an LLM pilot for analysts and verify rights aren’t bypassed?

Start with one or two reports and one mart without personal data to quickly iterate on metadata and rules. Run negative tests by role: attempts to access closed marts, requests to drill down to customers/employees, groupings on rare categories, JOIN-based bypasses and clever calculations. If you need an integrator to assemble the environment (roles, RLS, proxy, infrastructure and support), in Kazakhstan, for example, this can be done by system integrators such as GSE.kz.