Skip to content

CLI tool to summarize datasets for LLM context injection.

Notifications You must be signed in to change notification settings

abguven/data-summarizer-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 Data Summarizer for LLMs

Generate compact, context-rich dataset summaries optimized for Large Language Models (Gemini, ChatGPT, Claude).

Python Polars Docker License

🧐 Why this tool?

When working with LLMs (like Gemini or Claude), you often need to provide context about your data without uploading the entire 100MB CSV file (which consumes tokens and context window).

This tool reads your datasets (CSV, Excel, JSON, Parquet) and generates a lightweight Markdown summary containing:

  • ✅ Column names & Types
  • ✅ Missing values percentage
  • ✅ Unique value counts
  • ASCII Distributions for numeric columns ( ▂▃▅█)
  • ✅ Sample values

You can then simply copy-paste or attach this Markdown summary to your LLM prompt.


🚀 Quick Start (Docker)

No Python installation required. Just use Docker.

1. Structure

Create two folders on your computer:

my_project/
├── input/   <-- Put your CSV/Excel files here
└── output/  <-- Summaries will appear here

2. Run

Windows (PowerShell):

docker run --rm `
  -v "${PWD}/input:/app/data/input" `
  -v "${PWD}/output:/app/data/output" `
  abguven/data-summarizer:latest

Linux / Mac:

docker run --rm \
  -v "$(pwd)/input:/app/data/input" \
  -v "$(pwd)/output:/app/data/output" \
  abguven/data-summarizer:latest

🛠️ Features

  • Blazing Fast: Built on top of Polars (Rust-based DataFrame library).
  • Format Support: .csv, .parquet, .json, .xlsx, .xls.
  • Robust Excel: Includes a fallback mechanism (FastExcel -> Xlsx2csv) to handle complex or older Excel files.
  • Privacy First: Runs entirely locally in a container. No data leaves your machine.
  • Batch Processing: Analyzes all files in the input directory at once.

📦 Installation (For Developers)

If you want to modify the code or run it without Docker:

# Clone the repo
git clone https://github.com/abguven/data-summarizer-llm.git
cd data-summarizer-llm

# Install dependencies
pip install -r requirements.txt

# Run
python src/summarize_dataset.py

(Note: You'll need to adjust input/output paths in the script if running locally without Docker)


🤝 Contributing

Feel free to open issues or submit PRs if you want to add support for SQL databases or more advanced statistics!


Created by abguven for Data Engineering workflows.

Releases

No releases published

Packages

 
 
 

Contributors