Most “chat with your data” systems make the same mistake: they treat an LLM as if it were a database.
They shove rows into context, embed chunks, or pick “representative samples” and hope the model can infer structure, quality, and truth from anecdotes. It works just well enough to demo - and then collapses under scale, privacy constraints, or basic questions like “is this column junk?”
DataSummarizer takes the opposite approach.
It computes a deterministic statistical profile of your dataset using DuckDB, persists that profile, and optionally layers a local LLM on top to interpret those facts, propose safe read-only SQL, and guide follow-up analysis. The heavy lifting is numeric, auditable, and fast. The LLM is deliberately constrained to reasoning and narration.
The result is:
tool JSON)All without sending raw rows to a model.
This builds directly on How to Analyse Large CSV Files with Local LLMs in C#, which introduced the core idea:
LLMs should generate queries, not consume data.
This article pushes that idea further:
Full documentation lives in the DataSummarizer README The tool looks simple. It isn’t. Most of the work is in what isn’t sent to the model.
Note: This isn’t 1.0 yet. That’s intentional. The interface is stable; the edges are still being sharpened.
When teams bolt an LLM onto data analysis, they usually do one of three things:
Even with 200k token context windows, this is the wrong abstraction.
LLMs are not designed to compute aggregates, detect skew, or reason reliably about distributional properties by reading rows. They hallucinate because they’re being asked to do database work.
The correct split is still:
LLM reasons. Database computes.
But you can go one step further by changing what the LLM reasons over.
Instead of giving the model data, give it a profile.
A profile is a compact, deterministic summary of a dataset’s shape:
This profile becomes the interface between reasoning and computation.
The model can now:
…without ever seeing raw rows.
flowchart LR
A[Data file] --> B[DuckDB profiles]
B --> C[Statistical Profile JSON]
C --> D[LLM reasoning step]
D --> E[SQL/tool calls]
E --> F[DuckDB executes locally]
F --> G[Aggregate results]
G --> H[LLM synthesis]
style B stroke:#333,stroke-width:4px
style D stroke:#333,stroke-width:4px
style F stroke:#333,stroke-width:4px
This preserves the familiar LLM → SQL → DuckDB loop, but anchors it in facts:
A profile answers the boring-but-urgent questions immediately:
These are the questions you normally discover 30 minutes into spreadsheet archaeology.
The profile gives you them in seconds.
If you do enable the LLM, the profile is what stops it being performative.
With profile-only context, the model can:
You don’t need a bigger model.
You need better evidence.
I turned this into a CLI so I could run it on arbitrary files - including in cron and CI - without hand-writing analysis every time.
Windows: drag a file onto datasummarizer.exe
CLI:
datasummarizer mydata.csv
datasummarizer tool -f mydata.csv > profile.json
That profile.json is the contract:
Ollama endpoint: http://localhost:11434 (configurable)
Default model: qwen2.5-coder:7b (override with --model)
Default registry DB: .datasummarizer.vss.duckdb
SQL safety constraints:
COPY, ATTACH, INSTALL, CREATE, DROP, INSERT, UPDATE, DELETE, unsafe PRAGMAThe defaults are intentionally conservative. You can loosen them - but you have to opt in.
datasummarizer -f patients.csv --no-llm --fast
Output (974 patient records with PII):
── Summary ────────────────────────────────────────────────
974 rows, 20 columns. 4 columns have >10% nulls. 8 warnings.
Alerts flag leakage risks, high-null columns, constant fields, and suspicious identifiers - all computed by DuckDB.
No guessing. No hallucination.
datasummarizer tool -f patients.csv --store > profile.json
Abridged output:
{
"Profile": {
"RowCount": 974,
"ColumnCount": 20,
"ExecutiveSummary": "974 rows, 20 columns. 18 alerts."
},
"Metadata": {
"ProfileId": "0ae8dcc4d79b",
"SchemaHash": "44d9ad8af68c1c62"
}
}
This is designed to be diffed, stored, and audited.
Once profiles exist, drift becomes cheap:
datasummarizer tool -f daily_export.csv --auto-drift --store
It’s boring infrastructure - which is exactly what you want.
Once you have statistical shape, you can do something genuinely useful:
Generate synthetic datasets that:
This is ideal for demos, CI, support repros, and shareable samples.
The fidelity report quantifies how close the synthetic data is - rather than hand-waving “realistic”.
--no-llm produces fully auditable outputs suitable for CI or regulated environments.The LLM never invents facts. It reacts to them.
Repo: https://github.com/scottgal/mostlylucidweb/tree/main/Mostlylucid.DataSummarizer
Requirements:
Related:
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.