Utilizing MCP as a bridge between LLMs and large structured datasets. Unlike static notebooks, MCP enables dynamic, semantic (natural-language) querying. This structure enables LLMs to invoke mcp tools to access & query data 3,600× larger than Claude's context window.
WildChat Dataset: https://huggingface.co/datasets/allenai/WildChat-1M https://arxiv.org/abs/2405.01470
- 1M conversations = 737M tokens
- Claude's context limit = 200K tokens (chatGPT = ~128k)
- Dataset is 3,685× larger than Claude's context window
Used static analysis (notebooks) for traditional data science: creating visualizations, computing distributions, answering pre-defined questions. Setting up the MCP server allows for dynamic interaction where an LLM explores the dataset conversationally through on-demand queries.
DuckDB:
- One-time setup: Load parquet -> DuckDB database file
- Notebook & MCP queries query the DuckDB file directly
- Benefits:
- Zero-config embedded database
- Handles 1GB-100GB datasets efficiently
- SQL interface without server overhead Spark = designed for distributed clusters when data exceeds single-machine RAM. Wildchat does not meet this threshold.
| Without MCP | With MCP |
|---|---|
| LLMs hallucinate about unseen data | Tools execute actual database queries |
| Static notebooks = fixed questions only | Dynamic, ad-hoc exploration |
| Context window limits what's queryable | Access datasets 1000× larger than context |
MCP vs. RAG
| Feature | RAG | MCP |
|---|---|---|
| Returns | Text passages | Structured data (numbers, counts) |
| Accuracy | LLM estimates from text | Precise SQL results |
| Use Case | "What does the doc say?" | "What's the average/count/trend?" |
Example:
- RAG: Retrieves text, estimates "~50-60 Fellows"
- MCP:
SELECT COUNT(*) WHERE year=2020→ Exactly 57
source setup.sh
- run MCP server:
python mcp_server.py - update claude desktop config If using a venv:
"mcpServers": {
"wildchat-analytics": {
"command": "/Users/yourname/path/to/project/.venv/bin/python",
"args": ["/Users/yourname/path/to/project/mcp_server.py"],
"env": {
"WILDCHAT_DB_PATH": "/path/to/dot.db"
}
}
}
- Restart and reopen