Echo is an AI-powered data analyst that lets you query CSV datasets using plain English. Upload any CSV file, ask questions about your data, and get instant SQL-backed answers with auto-generated charts, tables, and KPI cards — streamed live as Echo thinks.
Live at: https://echo.ryanxiao.com
- Frontend --- React, TypeScript, Vite, Tailwind CSS, Chart.js, Lucide icons
- Backend --- Python, FastAPI, DuckDB (in-process analytics), Anthropic Claude (Sonnet + Haiku)
- Deployment --- Docker (multi-stage), Google Cloud Run, Artifact Registry, Secret Manager
- SSE streaming --- Pipeline stages and explanation text stream live word-by-word with a typewriter effect
- Multi-table datasets --- Upload multiple CSVs into one dataset via the upload modal; cross-table relationships are auto-detected
- Session-scoped uploads --- Uploaded datasets are tied to the browser session and deleted when the tab closes
- Schema viewer --- Expand the Context panel to see all tables and columns for the active dataset
- CSV export --- Download any result table as a CSV file
- Thumbs up/down feedback --- Rate each response; negative feedback is stored for quality tracking
- Staggered result animations --- Charts, SQL, and follow-up suggestions fade in sequentially after a response
- Context hints --- Add business annotations injected into every query (e.g. "price is in BRL not USD")
- Preloaded demo dataset --- Olist Brazilian E-Commerce dataset available immediately without uploading
When a user submits a question, the backend processes it through a multi-stage pipeline streamed live via SSE.
The system loads the dataset's full schema from the SQLite metadata database --- column names, data types, semantic types (id, metric, date, categorical), and AI-generated column descriptions. Only tables relevant to the question are selected using keyword matching against table and column names.
Files: semantic_layer.py (get_schema_for_prompt), sql_generator.py (_select_relevant_tables)
The system checks if the question is a follow-up to a prior turn. The last 3 conversation turns (question + SQL + result columns) are sent to Claude Haiku, which determines if the new question contains anaphoric references ("show me that by month"), comparative language ("instead of", "compared to"), or modification language ("now filter", "sort by").
If detected as a follow-up, prior context is injected into the SQL generation prompt so Claude can build on or modify previous queries.
Files: context_manager.py (is_follow_up, build_context_block)
For non-follow-up questions, the question and a summary of available tables/columns are sent to Claude Haiku to determine if the question has multiple valid SQL interpretations. If ambiguous, the pipeline short-circuits and returns 2-3 clarifying options to the user instead of guessing.
Files: �mbiguity.py (check_ambiguity)
Claude Haiku classifies whether the question requires multiple sequential SQL queries (e.g., "Compare this month's revenue to last month's"). If yes, Claude Sonnet decomposes it into 2-4 ordered steps, each producing a temp table that feeds into the next. Each step is individually generated and executed with self-healing retries.
If planning fails, the pipeline falls back to single-query mode.
Files: query_planner.py (should_decompose, build_plan, execute_plan)
Claude Sonnet generates a DuckDB-compatible SQL query. The prompt includes the filtered schema with column descriptions, sample data rows from each table, detected table relationships (join hints with confidence levels), user-added business context annotations, and conversation history if applicable.
The explanation text streams token-by-token to the frontend via SSE as Claude generates it. The LLM returns a structured JSON response with the SQL, explanation, and a list of assumptions.
Files: sql_generator.py (generate_sql, generate_sql_streaming), query.py (stream_query)
Before execution, the generated SQL is parsed using sqlglot and validated against the schema. The validator checks that all referenced tables exist, all referenced columns exist in their respective tables, table aliases are resolved correctly, and SELECT aliases used in ORDER BY / GROUP BY are not flagged as missing columns. If validation fails, the error is collected and the pipeline loops back to Stage 5 with correction context.
Files: �alidator.py (validate_sql)
The validated SQL is executed against DuckDB via the session's in-memory connection (which attaches the dataset's persistent DuckDB file as read-only). Execution time is measured. Decimal values are cast to float for JSON serialization.
Files: executor.py (execute_sql)
If validation or execution fails, the pipeline retries up to 3 times. On each retry, Claude Sonnet receives the failed SQL and all accumulated error messages, and generates a corrected query. The retry prompt appends error context to the original prompt so the LLM can learn from its mistakes.
Files: sql_generator.py (generate_sql_with_correction), query.py (stream_query loop)
If all retries are exhausted, instead of returning a generic error, Claude Haiku analyzes the question and the accumulated errors to generate 2-3 targeted clarifying questions that help the user rephrase their query. This converts hard failures into a conversational UX.
Files: ailure_clarifier.py (generate_failure_clarification)
On success, Claude Haiku examines the original question, result columns, and the first 3 rows of data to suggest 3 follow-up questions. Each suggestion explores a different analytical angle: a filter, a comparison, a trend, or a drill-down.
Files: ollowup.py (generate_follow_ups)
The final SQL is parsed with sqlglot to compute a complexity score based on table count, aggregation count, presence of subqueries/CTEs, and retry count. This is shown in the UI as a badge.
Files: query.py (_compute_complexity)
The question, final SQL, and result column schema are saved to the conversation_turns table. A sliding window of the last 3 turns is maintained per session for follow-up detection in future queries.
Files: context_manager.py (save_turn)
When a CSV is uploaded (or preloaded at startup), it goes through its own pipeline:
- CSV to DuckDB --- The CSV is loaded into a persistent DuckDB file using read_csv_auto with automatic type inference
- Type refinement --- VARCHAR columns are sampled to detect dates, booleans, and high-null-rate columns that DuckDB's auto-detection missed
- Schema registration --- Table and column metadata are stored in SQLite for fast schema lookups
- Semantic annotation --- Claude Haiku examines each table's schema + sample data and generates a natural language description and semantic type (id, foreign_key, date, metric, categorical, text, boolean) for every column
- Relationship detection --- Columns ending in _id are cross-matched across tables; value overlap is sampled to assign join confidence (high/medium/low) and join type (INNER/LEFT)
Multi-CSV upload: the upload modal supports selecting multiple CSV files at once. The first file creates the dataset; subsequent files are appended as additional tables to the same DuckDB file, with relationship detection running across all tables.
Backend:
cd backend
python -m venv .venv
.venv\Scripts\pip install -r requirements.txt
.venv\Scripts\uvicorn app.main:app --reload
Frontend:
cd frontend
npm install
npm run dev
gcloud builds submit --tag us-central1-docker.pkg.dev/PROJECT_ID/echo-repo/echo:latest .
gcloud run deploy echo --image us-central1-docker.pkg.dev/PROJECT_ID/echo-repo/echo:latest --platform managed --region us-central1 --port 8080 --allow-unauthenticated --set-secrets ANTHROPIC_API_KEY=ANTHROPIC_API_KEY:latest --memory 2Gi --cpu 2 --timeout 300 --concurrency 80 --min-instances 1 --max-instances 1
ANTHROPIC_API_KEY --- Your Anthropic API key DATA_DIR --- Path to preloaded CSV files (default: ../data) PRELOADED_DIR --- Path to persistent DuckDB files (default: ../preloaded) METADATA_DB --- Path to SQLite metadata database (default: ../metadata.db)