An intelligent chatbot that answers computational queries across Olist's 9 relational CSV datasets using natural language. Built with ReAct agent architecture, DuckDB, and Gemini AI.
- Natural Language Queries: Ask questions in plain English about the Olist e-commerce data
- Multi-table Analytics: Performs joins, filters, aggregations across all 9 CSV files
- Interactive Charts: Generates visualizations (line, bar, histogram, pie, scatter) on request
- Session Memory: Remembers context within the same chat session for follow-up queries
- ReAct Agent Loop: Intelligent planning and execution with validation (max 4 steps)
- Security Guardrails: SQL injection prevention and query validation
- Modern Web UI: Beautiful, responsive chat interface with TailwindCSS
- Orchestrator: ReAct-style agent loop (interpret → plan → validate → execute → verify → respond)
- LLM Brain: Gemini Flash 2.5 for intent parsing and response generation
- Tools: Schema, Query, Chart, and Memory tools with function calling
- Data Layer: DuckDB for efficient relational operations on CSV files
- Interface: FastAPI web application with real-time chat UI
- Guardrails: Security validation, performance limits, and error handling
- Python 3.8 or higher
- Gemini API key
# Install dependencies
pip install -r requirements.txt- Visit Google AI Studio
- Create a free API key
- Create a .env file and add
set GEMINI_API_KEY=your-api-key-here# Automated startup with checks
python start.py
# Or manual startup
python main.pyNavigate to http://localhost:8000
The Olist dataset contains 100k orders from 2016-2018 across Brazil with:
orders ⟷ order_items ⟷ products ⟷ category_translation
↓ ↓
customers sellers
↓ ↓
geolocation ←→ geolocation
orders ⟷ payments
orders ⟷ reviews
- Revenue Analysis: Total sales, average order value, revenue by state/category
- Customer Insights: Geographic distribution, purchase patterns, loyalty
- Product Performance: Top categories, pricing analysis, inventory trends
- Seller Analytics: Performance metrics, geographic coverage, ratings
- Operational KPIs: Delivery times, payment methods, review scores
- "What's the total revenue for 2017?"
- "How many customers are from São Paulo?"
- "Show me the average delivery time by state"
- "Which product categories have the highest profit margins?"
- "Create a chart showing monthly revenue trends"
- "Compare average review scores between different payment methods"
- "Find the top 10 sellers by total sales volume"
- "Show revenue by customer state with delivery performance"
- "Analyze the relationship between product price and review scores"
- "Which geographic regions have the fastest delivery times?"
- SQL Injection Prevention: Query validation and sanitization
- Allowed Operations: Only SELECT and WITH statements permitted
- Resource Limits: Row limits, query complexity scoring
- Input Sanitization: Dangerous character removal
- Rate Limiting: Configurable query limits per session
set the TEST_FLAG in .env file
set TEST_FLAG=trueKey settings in config.py:
MAX_QUERY_ROWS = 1000: Maximum rows returnedMAX_CHART_POINTS = 500: Chart data point limitMAX_SESSION_HISTORY = 10: Queries remembered per sessionSUPPORTED_CHART_TYPES: Available visualization types
- Use specific filters to reduce result size
- Limit date ranges for time-based queries
- Use LIMIT clause for large datasets
- Group by categorical columns for aggregations
- Olist: For providing the Brazilian e-commerce dataset
- Google: For Gemini AI API
- DuckDB: For fast analytical processing
- Plotly: For interactive visualizations