Skip to content

RyanXiaoo/Echo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Echo

Echo is an AI-powered data analyst that lets you query CSV datasets using plain English. Upload any CSV file, ask questions about your data, and get instant SQL-backed answers with auto-generated charts, tables, and KPI cards — streamed live as Echo thinks.

Live at: https://echo.ryanxiao.com

Tech Stack

  • Frontend --- React, TypeScript, Vite, Tailwind CSS, Chart.js, Lucide icons
  • Backend --- Python, FastAPI, DuckDB (in-process analytics), Anthropic Claude (Sonnet + Haiku)
  • Deployment --- Docker (multi-stage), Google Cloud Run, Artifact Registry, Secret Manager

Features

  • SSE streaming --- Pipeline stages and explanation text stream live word-by-word with a typewriter effect
  • Multi-table datasets --- Upload multiple CSVs into one dataset via the upload modal; cross-table relationships are auto-detected
  • Session-scoped uploads --- Uploaded datasets are tied to the browser session and deleted when the tab closes
  • Schema viewer --- Expand the Context panel to see all tables and columns for the active dataset
  • CSV export --- Download any result table as a CSV file
  • Thumbs up/down feedback --- Rate each response; negative feedback is stored for quality tracking
  • Staggered result animations --- Charts, SQL, and follow-up suggestions fade in sequentially after a response
  • Context hints --- Add business annotations injected into every query (e.g. "price is in BRL not USD")
  • Preloaded demo dataset --- Olist Brazilian E-Commerce dataset available immediately without uploading

Query Pipeline

When a user submits a question, the backend processes it through a multi-stage pipeline streamed live via SSE.

Stage 1: Schema Resolution

The system loads the dataset's full schema from the SQLite metadata database --- column names, data types, semantic types (id, metric, date, categorical), and AI-generated column descriptions. Only tables relevant to the question are selected using keyword matching against table and column names.

Files: semantic_layer.py (get_schema_for_prompt), sql_generator.py (_select_relevant_tables)

Stage 2: Conversation Context Detection

The system checks if the question is a follow-up to a prior turn. The last 3 conversation turns (question + SQL + result columns) are sent to Claude Haiku, which determines if the new question contains anaphoric references ("show me that by month"), comparative language ("instead of", "compared to"), or modification language ("now filter", "sort by").

If detected as a follow-up, prior context is injected into the SQL generation prompt so Claude can build on or modify previous queries.

Files: context_manager.py (is_follow_up, build_context_block)

Stage 3: Ambiguity Detection

For non-follow-up questions, the question and a summary of available tables/columns are sent to Claude Haiku to determine if the question has multiple valid SQL interpretations. If ambiguous, the pipeline short-circuits and returns 2-3 clarifying options to the user instead of guessing.

Files: �mbiguity.py (check_ambiguity)

Stage 4: Query Planning (Multi-Step)

Claude Haiku classifies whether the question requires multiple sequential SQL queries (e.g., "Compare this month's revenue to last month's"). If yes, Claude Sonnet decomposes it into 2-4 ordered steps, each producing a temp table that feeds into the next. Each step is individually generated and executed with self-healing retries.

If planning fails, the pipeline falls back to single-query mode.

Files: query_planner.py (should_decompose, build_plan, execute_plan)

Stage 5: SQL Generation (Streaming)

Claude Sonnet generates a DuckDB-compatible SQL query. The prompt includes the filtered schema with column descriptions, sample data rows from each table, detected table relationships (join hints with confidence levels), user-added business context annotations, and conversation history if applicable.

The explanation text streams token-by-token to the frontend via SSE as Claude generates it. The LLM returns a structured JSON response with the SQL, explanation, and a list of assumptions.

Files: sql_generator.py (generate_sql, generate_sql_streaming), query.py (stream_query)

Stage 6: Static Validation

Before execution, the generated SQL is parsed using sqlglot and validated against the schema. The validator checks that all referenced tables exist, all referenced columns exist in their respective tables, table aliases are resolved correctly, and SELECT aliases used in ORDER BY / GROUP BY are not flagged as missing columns. If validation fails, the error is collected and the pipeline loops back to Stage 5 with correction context.

Files: �alidator.py (validate_sql)

Stage 7: Execution

The validated SQL is executed against DuckDB via the session's in-memory connection (which attaches the dataset's persistent DuckDB file as read-only). Execution time is measured. Decimal values are cast to float for JSON serialization.

Files: executor.py (execute_sql)

Stage 8: Self-Healing Loop

If validation or execution fails, the pipeline retries up to 3 times. On each retry, Claude Sonnet receives the failed SQL and all accumulated error messages, and generates a corrected query. The retry prompt appends error context to the original prompt so the LLM can learn from its mistakes.

Files: sql_generator.py (generate_sql_with_correction), query.py (stream_query loop)

Stage 9: Failure Clarification

If all retries are exhausted, instead of returning a generic error, Claude Haiku analyzes the question and the accumulated errors to generate 2-3 targeted clarifying questions that help the user rephrase their query. This converts hard failures into a conversational UX.

Files: ailure_clarifier.py (generate_failure_clarification)

Stage 10: Follow-Up Generation

On success, Claude Haiku examines the original question, result columns, and the first 3 rows of data to suggest 3 follow-up questions. Each suggestion explores a different analytical angle: a filter, a comparison, a trend, or a drill-down.

Files: ollowup.py (generate_follow_ups)

Stage 11: Complexity Scoring

The final SQL is parsed with sqlglot to compute a complexity score based on table count, aggregation count, presence of subqueries/CTEs, and retry count. This is shown in the UI as a badge.

Files: query.py (_compute_complexity)

Stage 12: Turn Persistence

The question, final SQL, and result column schema are saved to the conversation_turns table. A sliding window of the last 3 turns is maintained per session for follow-up detection in future queries.

Files: context_manager.py (save_turn)


Data Ingestion Pipeline

When a CSV is uploaded (or preloaded at startup), it goes through its own pipeline:

  1. CSV to DuckDB --- The CSV is loaded into a persistent DuckDB file using read_csv_auto with automatic type inference
  2. Type refinement --- VARCHAR columns are sampled to detect dates, booleans, and high-null-rate columns that DuckDB's auto-detection missed
  3. Schema registration --- Table and column metadata are stored in SQLite for fast schema lookups
  4. Semantic annotation --- Claude Haiku examines each table's schema + sample data and generates a natural language description and semantic type (id, foreign_key, date, metric, categorical, text, boolean) for every column
  5. Relationship detection --- Columns ending in _id are cross-matched across tables; value overlap is sampled to assign join confidence (high/medium/low) and join type (INNER/LEFT)

Multi-CSV upload: the upload modal supports selecting multiple CSV files at once. The first file creates the dataset; subsequent files are appended as additional tables to the same DuckDB file, with relationship detection running across all tables.


Running Locally

Backend:

cd backend
python -m venv .venv
.venv\Scripts\pip install -r requirements.txt
.venv\Scripts\uvicorn app.main:app --reload

Frontend:

cd frontend
npm install
npm run dev

Open http://localhost:5173

Deploying to GCP

gcloud builds submit --tag us-central1-docker.pkg.dev/PROJECT_ID/echo-repo/echo:latest .

gcloud run deploy echo --image us-central1-docker.pkg.dev/PROJECT_ID/echo-repo/echo:latest --platform managed --region us-central1 --port 8080 --allow-unauthenticated --set-secrets ANTHROPIC_API_KEY=ANTHROPIC_API_KEY:latest --memory 2Gi --cpu 2 --timeout 300 --concurrency 80 --min-instances 1 --max-instances 1

Environment Variables

ANTHROPIC_API_KEY --- Your Anthropic API key DATA_DIR --- Path to preloaded CSV files (default: ../data) PRELOADED_DIR --- Path to persistent DuckDB files (default: ../preloaded) METADATA_DB --- Path to SQLite metadata database (default: ../metadata.db)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors