Back to "DataSummarizer: Fast Local Data Profiling"

This is a viewer only at the moment see the article on how this works.

To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk

This is a preview from the server running through my markdig pipeline

C# Data Analysis DuckDB LLM

DataSummarizer: Fast Local Data Profiling

Monday, 22 December 2025

Most “chat with your data” systems make the same mistake: they treat an LLM as if it were a database.

They shove rows into context, embed chunks, or pick “representative samples” and hope the model can infer structure, quality, and truth from anecdotes. It works just well enough to demo - and then collapses under scale, privacy constraints, or basic questions like “is this column junk?”

DataSummarizer takes the opposite approach.

It computes a deterministic statistical profile of your dataset using DuckDB, persists that profile, and optionally layers a local LLM on top to interpret those facts, propose safe read-only SQL, and guide follow-up analysis. The heavy lifting is numeric, auditable, and fast. The LLM is deliberately constrained to reasoning and narration.

The result is:

  • fast, private profiling
  • trustworthy automation (tool JSON)
  • built-in drift detection
  • and something most systems can’t do at all: PII-free synthetic datasets that preserve statistical shape

All without sending raw rows to a model.

This builds directly on How to Analyse Large CSV Files with Local LLMs in C#, which introduced the core idea:

LLMs should generate queries, not consume data.

This article pushes that idea further:

  • don’t feed the LLM rows
  • don’t even feed it “samples” unless you must
  • feed it statistical shape
  • let DuckDB compute facts
  • let the LLM reason over those facts - and ask for more

GitHub release

Full documentation lives in the DataSummarizer README The tool looks simple. It isn’t. Most of the work is in what isn’t sent to the model.

Note: This isn’t 1.0 yet. That’s intentional. The interface is stable; the edges are still being sharpened.


The Problem With “Chat With Your Data”

When teams bolt an LLM onto data analysis, they usually do one of three things:

  1. shove rows into context (falls over fast, leaks data)
  2. embed chunks and retrieve them (still row-level, still leaky)
  3. pass “representative samples” (often unrepresentative, still risky)

Even with 200k token context windows, this is the wrong abstraction.

LLMs are not designed to compute aggregates, detect skew, or reason reliably about distributional properties by reading rows. They hallucinate because they’re being asked to do database work.

The correct split is still:

LLM reasons. Database computes.

But you can go one step further by changing what the LLM reasons over.


The Key Upgrade: Statistics as the Interface

Instead of giving the model data, give it a profile.

A profile is a compact, deterministic summary of a dataset’s shape:

  • schema + inferred types
  • null rate, uniqueness, cardinality
  • min / max / quantiles, skew, outliers
  • top values for safe categoricals
  • PII risk signals (regex + classifier)
  • time-series structure (span, gaps, granularity)
  • optional drift deltas vs a baseline

This profile becomes the interface between reasoning and computation.

The model can now:

  • decide what’s interesting
  • propose sensible follow-up queries
  • interpret results
  • detect drift

…without ever seeing raw rows.

Architecture

flowchart LR
    A[Data file] --> B[DuckDB profiles]
    B --> C[Statistical Profile JSON]
    C --> D[LLM reasoning step]
    D --> E[SQL/tool calls]
    E --> F[DuckDB executes locally]
    F --> G[Aggregate results]
    G --> H[LLM synthesis]

    style B stroke:#333,stroke-width:4px
    style D stroke:#333,stroke-width:4px
    style F stroke:#333,stroke-width:4px

This preserves the familiar LLM → SQL → DuckDB loop, but anchors it in facts:

  • profiling is deterministic
  • the LLM operates on evidence, not anecdotes

Why the Profile Is Useful (Even Without an LLM)

A profile answers the boring-but-urgent questions immediately:

  • Which columns are junk (all-null, constant, near-constant)?
  • Which columns are leakage risks (near-unique identifiers)?
  • Which columns are high-null or outlier-heavy?
  • What are the dominant categories?
  • Is this time series contiguous or full of gaps?
  • Are distributions skewed or zero-inflated?

These are the questions you normally discover 30 minutes into spreadsheet archaeology.

The profile gives you them in seconds.


Why the Profile Makes the LLM Actually Useful

If you do enable the LLM, the profile is what stops it being performative.

With profile-only context, the model can:

  • focus on high-entropy or high-skew columns
  • suggest sensible group-bys (low-cardinality categoricals)
  • avoid garbage SQL (no grouping on 95%-unique columns)
  • notice distributional drift
  • ask for targeted follow-up statistics instead of requesting more data

You don’t need a bigger model.

You need better evidence.


The Tool: DataSummarizer

I turned this into a CLI so I could run it on arbitrary files - including in cron and CI - without hand-writing analysis every time.

GitHub release

Quick start

Windows: drag a file onto datasummarizer.exe

CLI:

datasummarizer mydata.csv

Tool mode (JSON output)

datasummarizer tool -f mydata.csv > profile.json

That profile.json is the contract:

  • consumed by agents
  • stored for drift detection
  • used for synthetic data generation

Operational Defaults (Explicit)

  • Ollama endpoint: http://localhost:11434 (configurable)

  • Default model: qwen2.5-coder:7b (override with --model)

  • Default registry DB: .datasummarizer.vss.duckdb

  • SQL safety constraints:

    • max 20 rows returned to the LLM
    • forbidden: COPY, ATTACH, INSTALL, CREATE, DROP, INSERT, UPDATE, DELETE, unsafe PRAGMA

The defaults are intentionally conservative. You can loosen them - but you have to opt in.


Deterministic Profiling (Example)

datasummarizer -f patients.csv --no-llm --fast

Output (974 patient records with PII):

── Summary ────────────────────────────────────────────────
974 rows, 20 columns. 4 columns have >10% nulls. 8 warnings.

Alerts flag leakage risks, high-null columns, constant fields, and suspicious identifiers - all computed by DuckDB.

No guessing. No hallucination.


Tool JSON for Automation

datasummarizer tool -f patients.csv --store > profile.json

Abridged output:

{
  "Profile": {
    "RowCount": 974,
    "ColumnCount": 20,
    "ExecutiveSummary": "974 rows, 20 columns. 18 alerts."
  },
  "Metadata": {
    "ProfileId": "0ae8dcc4d79b",
    "SchemaHash": "44d9ad8af68c1c62"
  }
}

This is designed to be diffed, stored, and audited.


Automatic Drift Detection (Cron-Friendly)

Once profiles exist, drift becomes cheap:

datasummarizer tool -f daily_export.csv --auto-drift --store
  • KS distance for numeric columns
  • Jensen–Shannon divergence for categoricals
  • schema-fingerprinted baselines

It’s boring infrastructure - which is exactly what you want.


The Killer Feature: Synthetic Clones From Shape

Once you have statistical shape, you can do something genuinely useful:

Generate synthetic datasets that:

  • contain no original values
  • contain no PII
  • preserve distribution, cardinality, and sparsity effects

This is ideal for demos, CI, support repros, and shareable samples.

The fidelity report quantifies how close the synthetic data is - rather than hand-waving “realistic”.


Trust Model

  • Deterministic: all stats, quantiles, drift scores, and alerts are computed by DuckDB.
  • Optional LLM: narration, reasoning, and SQL suggestions only.
  • Hermetic mode: --no-llm produces fully auditable outputs suitable for CI or regulated environments.

The LLM never invents facts. It reacts to them.


Get It

Repo: https://github.com/scottgal/mostlylucidweb/tree/main/Mostlylucid.DataSummarizer

Requirements:

  • .NET 10
  • DuckDB (embedded)
  • optional: Ollama

Related:

logo

© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.