Echo

Echo is an AI-powered data analyst that lets you query CSV datasets using plain English. Upload any CSV file, ask questions about your data, and get instant SQL-backed answers with auto-generated charts, tables, and KPI cards â€” streamed live as Echo thinks.

Live at: https://echo.ryanxiao.com

Tech Stack

Frontend --- React, TypeScript, Vite, Tailwind CSS, Chart.js, Lucide icons
Backend --- Python, FastAPI, DuckDB (in-process analytics), Anthropic Claude (Sonnet + Haiku)
Deployment --- Docker (multi-stage), Google Cloud Run, Artifact Registry, Secret Manager

Features

SSE streaming --- Pipeline stages and explanation text stream live word-by-word with a typewriter effect
Multi-table datasets --- Upload multiple CSVs into one dataset via the upload modal; cross-table relationships are auto-detected
Session-scoped uploads --- Uploaded datasets are tied to the browser session and deleted when the tab closes
Schema viewer --- Expand the Context panel to see all tables and columns for the active dataset
CSV export --- Download any result table as a CSV file
Thumbs up/down feedback --- Rate each response; negative feedback is stored for quality tracking
Staggered result animations --- Charts, SQL, and follow-up suggestions fade in sequentially after a response
Context hints --- Add business annotations injected into every query (e.g. "price is in BRL not USD")
Preloaded demo dataset --- Olist Brazilian E-Commerce dataset available immediately without uploading

Query Pipeline

When a user submits a question, the backend processes it through a multi-stage pipeline streamed live via SSE.

Stage 1: Schema Resolution

The system loads the dataset's full schema from the SQLite metadata database --- column names, data types, semantic types (id, metric, date, categorical), and AI-generated column descriptions. Only tables relevant to the question are selected using keyword matching against table and column names.

Files: semantic_layer.py (get_schema_for_prompt), sql_generator.py (_select_relevant_tables)

Stage 2: Conversation Context Detection

The system checks if the question is a follow-up to a prior turn. The last 3 conversation turns (question + SQL + result columns) are sent to Claude Haiku, which determines if the new question contains anaphoric references ("show me that by month"), comparative language ("instead of", "compared to"), or modification language ("now filter", "sort by").

If detected as a follow-up, prior context is injected into the SQL generation prompt so Claude can build on or modify previous queries.

Files: context_manager.py (is_follow_up, build_context_block)

Stage 3: Ambiguity Detection

For non-follow-up questions, the question and a summary of available tables/columns are sent to Claude Haiku to determine if the question has multiple valid SQL interpretations. If ambiguous, the pipeline short-circuits and returns 2-3 clarifying options to the user instead of guessing.

Files: �mbiguity.py (check_ambiguity)

Stage 4: Query Planning (Multi-Step)

Claude Haiku classifies whether the question requires multiple sequential SQL queries (e.g., "Compare this month's revenue to last month's"). If yes, Claude Sonnet decomposes it into 2-4 ordered steps, each producing a temp table that feeds into the next. Each step is individually generated and executed with self-healing retries.

If planning fails, the pipeline falls back to single-query mode.

Files: query_planner.py (should_decompose, build_plan, execute_plan)

Stage 5: SQL Generation (Streaming)

Claude Sonnet generates a DuckDB-compatible SQL query. The prompt includes the filtered schema with column descriptions, sample data rows from each table, detected table relationships (join hints with confidence levels), user-added business context annotations, and conversation history if applicable.

The explanation text streams token-by-token to the frontend via SSE as Claude generates it. The LLM returns a structured JSON response with the SQL, explanation, and a list of assumptions.

Files: sql_generator.py (generate_sql, generate_sql_streaming), query.py (stream_query)

Stage 6: Static Validation

Before execution, the generated SQL is parsed using sqlglot and validated against the schema. The validator checks that all referenced tables exist, all referenced columns exist in their respective tables, table aliases are resolved correctly, and SELECT aliases used in ORDER BY / GROUP BY are not flagged as missing columns. If validation fails, the error is collected and the pipeline loops back to Stage 5 with correction context.

Files: �alidator.py (validate_sql)

Stage 7: Execution

The validated SQL is executed against DuckDB via the session's in-memory connection (which attaches the dataset's persistent DuckDB file as read-only). Execution time is measured. Decimal values are cast to float for JSON serialization.

Files: executor.py (execute_sql)

Stage 8: Self-Healing Loop

If validation or execution fails, the pipeline retries up to 3 times. On each retry, Claude Sonnet receives the failed SQL and all accumulated error messages, and generates a corrected query. The retry prompt appends error context to the original prompt so the LLM can learn from its mistakes.

Files: sql_generator.py (generate_sql_with_correction), query.py (stream_query loop)

Stage 9: Failure Clarification

If all retries are exhausted, instead of returning a generic error, Claude Haiku analyzes the question and the accumulated errors to generate 2-3 targeted clarifying questions that help the user rephrase their query. This converts hard failures into a conversational UX.

Files: ailure_clarifier.py (generate_failure_clarification)

Stage 10: Follow-Up Generation

On success, Claude Haiku examines the original question, result columns, and the first 3 rows of data to suggest 3 follow-up questions. Each suggestion explores a different analytical angle: a filter, a comparison, a trend, or a drill-down.

Files: ollowup.py (generate_follow_ups)

Stage 11: Complexity Scoring

The final SQL is parsed with sqlglot to compute a complexity score based on table count, aggregation count, presence of subqueries/CTEs, and retry count. This is shown in the UI as a badge.

Files: query.py (_compute_complexity)

Stage 12: Turn Persistence

The question, final SQL, and result column schema are saved to the conversation_turns table. A sliding window of the last 3 turns is maintained per session for follow-up detection in future queries.

Files: context_manager.py (save_turn)

Data Ingestion Pipeline

When a CSV is uploaded (or preloaded at startup), it goes through its own pipeline:

CSV to DuckDB --- The CSV is loaded into a persistent DuckDB file using read_csv_auto with automatic type inference
Type refinement --- VARCHAR columns are sampled to detect dates, booleans, and high-null-rate columns that DuckDB's auto-detection missed
Schema registration --- Table and column metadata are stored in SQLite for fast schema lookups
Semantic annotation --- Claude Haiku examines each table's schema + sample data and generates a natural language description and semantic type (id, foreign_key, date, metric, categorical, text, boolean) for every column
Relationship detection --- Columns ending in _id are cross-matched across tables; value overlap is sampled to assign join confidence (high/medium/low) and join type (INNER/LEFT)

Multi-CSV upload: the upload modal supports selecting multiple CSV files at once. The first file creates the dataset; subsequent files are appended as additional tables to the same DuckDB file, with relationship detection running across all tables.

Running Locally

Backend:

cd backend
python -m venv .venv
.venv\Scripts\pip install -r requirements.txt
.venv\Scripts\uvicorn app.main:app --reload

Frontend:

cd frontend
npm install
npm run dev

Open http://localhost:5173

Deploying to GCP

gcloud builds submit --tag us-central1-docker.pkg.dev/PROJECT_ID/echo-repo/echo:latest .

gcloud run deploy echo --image us-central1-docker.pkg.dev/PROJECT_ID/echo-repo/echo:latest --platform managed --region us-central1 --port 8080 --allow-unauthenticated --set-secrets ANTHROPIC_API_KEY=ANTHROPIC_API_KEY:latest --memory 2Gi --cpu 2 --timeout 300 --concurrency 80 --min-instances 1 --max-instances 1

Environment Variables

ANTHROPIC_API_KEY --- Your Anthropic API key DATA_DIR --- Path to preloaded CSV files (default: ../data) PRELOADED_DIR --- Path to persistent DuckDB files (default: ../preloaded) METADATA_DB --- Path to SQLite metadata database (default: ../metadata.db)

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
backend		backend
frontend		frontend
.dockerignore		.dockerignore
.gcloudignore		.gcloudignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Echo

Tech Stack

Features

Query Pipeline

Stage 1: Schema Resolution

Stage 2: Conversation Context Detection

Stage 3: Ambiguity Detection

Stage 4: Query Planning (Multi-Step)

Stage 5: SQL Generation (Streaming)

Stage 6: Static Validation

Stage 7: Execution

Stage 8: Self-Healing Loop

Stage 9: Failure Clarification

Stage 10: Follow-Up Generation

Stage 11: Complexity Scoring

Stage 12: Turn Persistence

Data Ingestion Pipeline

Running Locally

Deploying to GCP

Environment Variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Echo

Tech Stack

Features

Query Pipeline

Stage 1: Schema Resolution

Stage 2: Conversation Context Detection

Stage 3: Ambiguity Detection

Stage 4: Query Planning (Multi-Step)

Stage 5: SQL Generation (Streaming)

Stage 6: Static Validation

Stage 7: Execution

Stage 8: Self-Healing Loop

Stage 9: Failure Clarification

Stage 10: Follow-Up Generation

Stage 11: Complexity Scoring

Stage 12: Turn Persistence

Data Ingestion Pipeline

Running Locally

Deploying to GCP

Environment Variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages