You open a new CSV. 10,000 rows, 50 columns. What's in here? Which columns are junk? Where are the nulls? What's skewed?
DataSummarizer answers those questions in under a second — no cloud upload, no manual exploration, just deterministic stats + optional AI insights.
On Windows: Drag a CSV/Excel/Parquet file onto datasummarizer.exe and it analyzes instantly.
On Linux/macOS or command-line:
datasummarizer mydata.csv
It detects the file and profiles it automatically (no -f flag needed).
Or run with no args for interactive mode:
datasummarizer
Prompts you for a file path:
Welcome to DataSummarizer!
DuckDB-powered data profiling - analyze CSV, Excel, Parquet, JSON files
Enter path to data file: _
Why this is great: Zero friction. Drag-and-drop, or just pass the filename. No flags to remember, no docs to read first.
If you prefer commands:
datasummarizer -f pii-test.csv --no-llm --fast
Actual output (5 rows × 4 columns in <1 second):
── Summary ─────────────────────────────────────────────────────────
This dataset contains 5 rows and 4 columns. Column breakdown: 4
categorical. No major data quality issues detected.
╭────────┬─────────────┬───────┬────────┬───────────────────────╮
│ Column │ Type │ Nulls │ Unique │ Stats │
├────────┼─────────────┼───────┼────────┼───────────────────────┤
│ Name │ Categorical │ 0.0% │ 4 │ top: Alice Brown │
│ Email │ Categorical │ 0.0% │ 4 │ top: alice@domain.net │
│ Phone │ Categorical │ 0.0% │ 5 │ top: 555-789-0123 │
│ SSN │ Categorical │ 0.0% │ 5 │ top: 222-33-4444 │
╰────────┴─────────────┴───────┴────────┴───────────────────────╯
── Alerts ──────────────────────────────────────────────────────────
- Phone: 100.0% unique - possibly an ID column
- SSN: 100.0% unique - possibly an ID column
- Name: ⚠ Potential PersonName detected (30% confidence). Risk level: Medium
- Email: ⚠ Potential Email detected (100% confidence). Risk level: High
- Phone: ⚠ Potential PhoneNumber detected (100% confidence). Risk level: High
- SSN: ⚠ Potential SSN detected (100% confidence). Risk level: Critical
What you get instantly:
All deterministic - computed by DuckDB, not guessed by an LLM.
Compare two datasets to understand distributional differences:
datasummarizer segment --segment-a pii-test.csv --segment-b timeseries-weekly.csv
Actual output:
{
"SegmentAName": "pii-test.csv",
"SegmentBName": "timeseries-weekly.csv",
"SegmentARowCount": 5,
"SegmentBRowCount": 365,
"Similarity": 0,
"OverallDistance": 1,
"AnomalyScoreA": 0.28,
"AnomalyScoreB": 0.018,
"Insights": [
"Segments are substantially different (<50% similarity)",
"Segment sizes differ by +7200.0% (5 vs 365 rows)"
]
}
Use cases:
Monitor data changes without manual baseline management:
datasummarizer tool -f daily_export.csv --auto-drift --store
What happens:
For the first run, drift is null (no baseline exists yet):
{
"Success": true,
"Profile": { "RowCount": 5, "ColumnCount": 4 },
"Drift": null
}
On subsequent runs, you'll get drift metrics comparing against the baseline.
Run it in a cron job:
# Daily at 2am
0 2 * * * datasummarizer tool -f /data/daily_export.csv --auto-drift --store > /logs/drift.json
No manual baseline management - it automatically picks the right baseline based on schema fingerprint.
DataSummarizer keeps a profile store (local DuckDB) to track profiles over time.
List all stored profiles:
datasummarizer store list
Actual output:
╭──────────────┬──────────────────┬────────┬──────┬──────────┬─────────────────╮
│ ID │ File │ Rows │ Cols │ Schema │ Stored │
├──────────────┼──────────────────┼────────┼──────┼──────────┼─────────────────┤
│ 74e6b186cfad │ pii-test.csv │ 5 │ 4 │ 26240c83 │ 2025-12-20 │
│ │ │ │ │ │ 12:26 │
│ a8edaed514a8 │ pii-test.csv │ 5 │ 4 │ 26240c83 │ 2025-12-20 │
│ │ │ │ │ │ 01:45 │
╰──────────────┴──────────────────┴────────┴──────┴──────────┴─────────────────╯
Total: 2 profile(s)
Interactive menu (requires interactive terminal):
datasummarizer store
Features:
Other store commands:
# Show statistics
datasummarizer store stats
# Prune old profiles (keep 5 per schema)
datasummarizer store prune --keep 5
# Clear all profiles
datasummarizer store clear
Deterministic (computed facts):
Heuristic (fast approximations):
LLM-generated (optional):
flowchart LR
F[CSV/Excel/Parquet/JSON] --> D[DuckDB<br/>computes stats]
D --> P[Profile<br/>facts + alerts]
P --> R[Report]
P -.-> L[Optional LLM<br/>narrate or SQL]
L -.-> R
style D stroke:#333,stroke-width:4px
style L stroke:#333,stroke-dasharray: 5 5
Why this order matters: LLMs can't reliably compute aggregates from raw rows. We compute facts first, then optionally narrate.
Key design principles:
What's deterministic:
What's heuristic:
What's LLM-generated:
SQL safety (when LLM enabled):
.NET 10, out-of-core analytics via DuckDB:
| Dataset | Rows | Columns | Time (--fast --no-llm) |
|---|---|---|---|
| Small PII | 5 | 4 | <1 second |
| Time-series | 365 | 8 | <1 second |
| Bank churn (typical) | 10,000 | 13 | ~1 second |
| Sales (larger) | 100,000 | 14 | ~2 seconds |
| Wide table | 50,000 | 200 | ~8 seconds (with --max-columns 50) |
Memory: DuckDB handles files larger than RAM using out-of-core processing.
DataSummarizer supports three output formats:
1. Human-readable (default):
2. JSON (tool mode):
datasummarizer tool -f data.csv > profile.json
3. Markdown/HTML:
datasummarizer validate --source a.csv --target b.csv --format markdown
Repository: github.com/scottgal/mostlylucidweb/tree/main/Mostlylucid.DataSummarizer
Requirements:
Full documentation: See the README for comprehensive command reference, all options, and advanced features.
Related:
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.