Skip to content

mcp interface over 1M row convo dataset to make exact queries through natural language rather than sql

Notifications You must be signed in to change notification settings

Kyle-Zhou/wildchat-mcp

Repository files navigation

wildchat-mcp

Value Prop

Utilizing MCP as a bridge between LLMs and large structured datasets. Unlike static notebooks, MCP enables dynamic, semantic (natural-language) querying. This structure enables LLMs to invoke mcp tools to access & query data 3,600× larger than Claude's context window.

WildChat Dataset: https://huggingface.co/datasets/allenai/WildChat-1M https://arxiv.org/abs/2405.01470

  • 1M conversations = 737M tokens
  • Claude's context limit = 200K tokens (chatGPT = ~128k)
  • Dataset is 3,685× larger than Claude's context window

Used static analysis (notebooks) for traditional data science: creating visualizations, computing distributions, answering pre-defined questions. Setting up the MCP server allows for dynamic interaction where an LLM explores the dataset conversationally through on-demand queries.

DuckDB:

  1. One-time setup: Load parquet -> DuckDB database file
  2. Notebook & MCP queries query the DuckDB file directly
  • Benefits:
    • Zero-config embedded database
    • Handles 1GB-100GB datasets efficiently
    • SQL interface without server overhead Spark = designed for distributed clusters when data exceeds single-machine RAM. Wildchat does not meet this threshold.
Without MCP With MCP
LLMs hallucinate about unseen data Tools execute actual database queries
Static notebooks = fixed questions only Dynamic, ad-hoc exploration
Context window limits what's queryable Access datasets 1000× larger than context

MCP vs. RAG

Feature RAG MCP
Returns Text passages Structured data (numbers, counts)
Accuracy LLM estimates from text Precise SQL results
Use Case "What does the doc say?" "What's the average/count/trend?"

Example:

  • RAG: Retrieves text, estimates "~50-60 Fellows"
  • MCP: SELECT COUNT(*) WHERE year=2020 → Exactly 57

Setup

load dataset

source setup.sh

Running/connecting MCP

  • run MCP server: python mcp_server.py
  • update claude desktop config If using a venv:
  "mcpServers": {
    "wildchat-analytics": {
      "command": "/Users/yourname/path/to/project/.venv/bin/python",
      "args": ["/Users/yourname/path/to/project/mcp_server.py"],
      "env": {
        "WILDCHAT_DB_PATH": "/path/to/dot.db"
      }
    }
  }
  • Restart and reopen

About

mcp interface over 1M row convo dataset to make exact queries through natural language rather than sql

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published