First multi-table Text-to-Action system powered by small language models (3B params)
Query Excel/CSV files with good accuracy using only 3B models—no GPT-4, no database setup needed.
Current Limitations of Text-to-SQL Systems:
- ❌ Require large models (20B+ parameters like GPT-4, Claude)
- ❌ Need complex database setup (PostgreSQL, MySQL)
- ❌ Single-table only or manual JOIN configuration
- ❌ Expensive API costs for production use
Our Solution:
- ✅ 3B model achieves good accuracy on multi-table queries
- ✅ Zero database setup: Point directly to Excel/CSV files
- ✅ Auto-JOIN: Automatically merges related tables via metadata
- ✅ Domain-agnostic: Works across e-commerce, travel, finance, etc.
- ✅ Efficient: 7GB VRAM (full precision) or 4GB (4-bit quantization)
- ✅ 10-minute setup: From installation to first query in under 10 minutes
pip install torch transformers pandas openpyxl bitsandbytes accelerate networkxor
pip install -r requirements.txtHardware Requirements:
- Standard Mode: 7GB VRAM (RTX 2070, RTX 3060, T4)
- 4-bit Mode: 4GB VRAM (GTX 1650, RTX 2060) with
--use_4bit - CPU Mode: Works but slower (~30-60s per query)
Expected Performance:
- Query Response Time: 5-10 seconds per query (cold start)
- Cached Queries: < 1 second (after first run)
- Model Loading: ~30 seconds (one-time startup cost)
Use our pre-configured datasets from Hugging Face:
# Clone the dataset repository
git clone https://huggingface.co/datasets/darren0301/schema-action
# Or download manually from:
# https://huggingface.co/datasets/darren0301/schema-actionExample 1: Adventure Works (E-commerce)
python schema_action_querying.py \
--data_directory ./schema-action/adventureworks \
--metadata_path ./configs/adventureworks/metadata.json \
--join_config_path ./configs/adventureworks/join_config.jsonFor 4GB VRAM GPUs:
python schema_action_querying.py \
--data_directory ./schema-action/adventureworks \
--metadata_path ./configs/adventureworks/metadata.json \
--join_config_path ./configs/adventureworks/join_config.json \
--use_4bit # Reduces memory to 4GBExample 2: Travel Booking
python schema_action_querying.py \
--data_directory ./schema-action/travel \
--metadata_path ./configs/travel/metadata.json \
--join_config_path ./configs/travel/join_config.jsonYou: How many customers are homeowners?
Assistant: There are 8,658 customers who are homeowners.
(Time: 5.2s | Confidence: 0.95)
You: Who are the top 5 customers by annual income?
Assistant: Here are the top 5 customers by annual income:
1. Jon Yang - $170,000
2. Eugene Huang - $170,000
3. Ruben Torres - $160,000
4. Christy Zhu - $160,000
5. Elizabeth Johnson - $150,000
(Time: 6.8s | Confidence: 0.92)
You: Is there an order from customer 11000 for product 344?
Assistant: Yes, customer 11000 has placed an order for product 344.
(Time: 7.1s | Confidence: 0.98)
You: What is the age of Wilma Mcinnis?
Assistant: Wilma Mcinnis is 37 years old.
(Time: 4.9s | Confidence: 0.97)
You: Which 10 flights have the shortest duration?
Assistant: Here are the 10 flights with the shortest duration:
1. São Paulo (SP) to Brasilia (DF) - 0.92 hours
2. Recife (PE) to Salvador (BH) - 1.08 hours
...
(Time: 8.3s | Confidence: 0.93)
You: How many flights were 'economic' class?
Assistant: There were 1,247 economic class flights.
(Time: 5.6s | Confidence: 0.96)
See examples/example_queries.md for 40+ more examples.
Unlike traditional querying systems that use vector search, we employ a code-first agentic architecture:
┌─────────────────────────────────────────────────────────────┐
│ User Query: "How many customers are homeowners?" │
└─────────────────────┬───────────────────────────────────────┘
│
┌─────────────▼─────────────┐
│ Stage 1: Table Selector │ ← 3B Model (semantic matching)
│ Selected: customers.csv │
└─────────────┬─────────────┘
│
┌─────────────▼──────────────┐
│ Stage 2A: Column Identifier│ ← 3B Model (schema analysis)
│ Filter: HomeOwner │
└─────────────┬──────────────┘
│
┌─────────────▼──────────────┐
│ Stage 2B: Intent Classifier│ ← 3B Model (task routing)
│ Task: COUNT │
└─────────────┬──────────────┘
│
┌─────────────▼──────────────┐
│ Stage 3: Value Normalizer │ ← 3B Model + Code
│ Normalized: HomeOwner='Y' │
└─────────────┬──────────────┘
│
┌─────────────▼──────────────┐
│ Stage 4: Execution Engine │ ← Pure Code (pandas operations)
│ Count: 8,658 │
└─────────────┬──────────────┘
│
┌─────────────▼──────────────┐
│ Stage 5: Response Synthesis│ ← 3B Model (natural language)
│ "There are 8,658..." │
└────────────────────────────┘
- Task Specialization: Each 3B model handles ONE focused micro-task (e.g., "Is this a COUNT or FILTER?")
- Code-First Determinism: AI handles semantic understanding; code ensures precise execution
- Dynamic Planning: Schema can be logically transformed based on query complexity
Organize your CSV/Excel files in a folder:
my_data/
├── customers.csv
├── orders_2023.csv
└── orders_2024.csv
🎯 The Easy Way: Let AI Do the Work!
Instead of manually writing JSON, simply:
- Copy your Excel/CSV column headers
- Copy 2-3 sample rows of data
- Paste into ChatGPT/Claude with this prompt:
I need to create configuration files for the Schema-Action-Querying system.
My data files:
- customers.csv
- orders_2023.csv
- orders_2024.csv
Here are the columns and sample data:
[CUSTOMERS.CSV]
Columns: customer_id, name, email, city, age
Sample data:
1, John Smith, john@email.com, New York, 35
2, Jane Doe, jane@email.com, Los Angeles, 28
[ORDERS_2023.CSV]
Columns: order_id, customer_id, order_date, total_amount
Sample data:
101, 1, 2023-05-15, 150.00
102, 2, 2023-06-20, 89.50
Please generate:
1. metadata.json (describing the structure)
2. join_config.json (defining the relationships)
Use these templates as reference:
[PASTE THE TEMPLATES BELOW]
📋 Template 1: metadata.json
{
"files": [
{
"name": "customers.csv",
"description": "Customer demographic information",
"primary_key": "customer_id",
"foreign_keys": [],
"columns": {
"customer_id": "Unique customer identifier",
"name": "Customer full name",
"email": "Contact email address",
"city": "Customer location",
"age": "Customer age in years"
},
"keywords": ["customer", "client", "user", "demographic", "contact"]
},
{
"name": "orders_2023.csv",
"description": "Sales orders for year 2023",
"primary_key": "order_id",
"foreign_keys": ["customer_id"],
"time_range": "2023",
"columns": {
"order_id": "Unique order identifier",
"customer_id": "Customer who placed order (links to customers.csv)",
"order_date": "Date of purchase (YYYY-MM-DD format)",
"total_amount": "Order total in dollars"
},
"keywords": ["order", "sales", "purchase", "2023", "transaction"]
}
]
}📋 Template 2: join_config.json
{
"join_relationships": [
{
"table1": "orders_2023.csv",
"table2": "customers.csv",
"left_key": "customer_id",
"right_key": "customer_id",
"join_type": "inner",
"description": "Link orders to customer information"
},
{
"table1": "orders_2024.csv",
"table2": "customers.csv",
"left_key": "customer_id",
"right_key": "customer_id",
"join_type": "inner",
"description": "Link 2024 orders to customer information"
}
]
}⚙️ Configuration Rules:
For metadata.json:
"name": Must EXACTLY match your CSV/Excel filename"description": Brief description of what data this file contains"primary_key": The unique identifier column (like ID)"foreign_keys": Columns that link to other tables"columns": Each column with a clear description"keywords": Words that help AI select the right table
For join_config.json:
"table1"/"table2": The two files to join"left_key"/"right_key": The matching columns"join_type": Usually "inner" (keep matching rows only)- Use "left" to keep all rows from table1
- Use "outer" to keep all rows from both tables
python schema_action_querying.py \
--data_directory ./my_data \
--metadata_path ./my_metadata.json \
--join_config_path ./my_join_config.json🎉 That's it! You're ready to query your data in natural language!
This system works across any tabular domain with zero code changes:
| Domain | Tested | Example Query |
|---|---|---|
| E-commerce | ✅ | "Top 10 customers by revenue in Q4" |
| Travel | ✅ | "How many first-class flights to Miami?" |
| Finance | 🔄 | "Transactions over $10k in January" |
| Healthcare | 🔄 | "Patients with diagnosis code X in 2024" |
| HR | 🔄 | "Employees hired after 2020 in Engineering" |
Your domain? Just provide CSV files + AI-generated config!
The system supports any instruction-tuned LLM ≥3B parameters. Edit the model path in schema_action_querying.py:
# Line ~1420 - Model initialization
self.tokenizer_3b = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
self.model_3b = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B-Instruct",
quantization_config=quant_config,
...
)Tested Alternative Models:
| Model | Size | VRAM | Performance | Notes |
|---|---|---|---|---|
meta-llama/Llama-3.2-3B-Instruct |
3B | 7GB (4GB w/ 4-bit) | ⭐⭐⭐⭐⭐ | Recommended - Best accuracy |
Qwen/Qwen2.5-3B-Instruct |
3B | 7GB (4GB w/ 4-bit) | ⭐⭐⭐⭐ | Faster inference, slightly lower accuracy |
microsoft/Phi-3-mini-4k-instruct |
3.8B | 8GB (4GB w/ 4-bit) | ⭐⭐⭐⭐ | Good alternative, longer context |
google/gemma-2b-it |
2B | 5GB (3GB w/ 4-bit) | ⭐⭐⭐ | Lower memory but reduced accuracy |
Performance Notes:
- ✅ 3B+ models: good accuracy (recommended)
⚠️ 2B models: low accuracy (acceptable for simple queries)- ❌ < 2B models: Not recommended
| Query Type | Example | Response Time | Status |
|---|---|---|---|
| COUNT | "How many customers are married?" | 5-7s | ✅ Supported |
| FILTER | "Show orders from customer 12345" | 6-8s | ✅ Supported |
| SORT | "Top 10 customers by income" | 7-10s | ✅ Supported |
| Multi-table JOIN | "Orders from 2015-2017 by customer age" | 8-12s | ✅ Supported |
| AGGREGATE | "Average income by occupation" | - | 🚧 Planned v0.2 |
| GROUP BY | "Total sales by product category" | - | 🚧 Planned v0.2 |
| Approach | Model Size | Accuracy | Setup Time |
|---|---|---|---|
| GPT-4 Text-to-SQL | 1.7T params | 85% | Database required |
| LlamaIndex (70B) | 70B params | 82% | Database required |
| Schema-Action-Querying | 3B params | 80+% | 10 minutes |
Why Small Models Work:
- Task decomposition: Each 3B model solves ONE simple problem
- Code-constrained execution: Deterministic pandas operations (no hallucination)
- Result: GPT-4 accuracy at 560× smaller size
| Metric | Standard Mode | 4-bit Mode | CPU Mode |
|---|---|---|---|
| VRAM Required | 7GB | 4GB | 0GB (RAM only) |
| Model Load Time | ~30s | ~30s | ~60s |
| First Query | 5-10s | 6-12s | 30-60s |
| Cached Query | < 1s | < 1s | 2-3s |
| Recommended GPU | RTX 3060, T4 | GTX 1650, RTX 2060 | N/A |
All AI decisions are cached automatically:
# First query (processes everything)
"How many customers are homeowners?" → 8.2s
# Second identical query (cache hit)
"How many customers are homeowners?" → 0.3s # 27× faster!
# Semantically similar query (cache hit)
"How many customers own homes?" → 0.3s # Smart matching!For date range queries, the system automatically splits date columns:
User: "How many orders in October 2019?"
System automatically creates:
OrderDate → OrderDate_year, OrderDate_month, OrderDate_day
Precise execution:
df[(df['OrderDate_year'] == 2019) & (df['OrderDate_month'] == 10)]
No fuzzy string matching—exact type-safe filtering!
The system intelligently handles different data formats:
# Currency: "$150,000.50" → 150000.50
# Percentages: "85%" → 0.85
# Dates: "01-15-2017" → datetime(2017, 1, 15)
# Booleans: "Y" / "N" → True / False
# Missing: "" / NaN → [MISSING] (explicit handling)Q: CUDA out of memory error?
```bash # Solution 1: Enable 4-bit quantization (7GB → 4GB) python schema_action_querying.py --use_4bit ...python schema_action_querying.py --device cpu ...
python schema_action_querying.py --cache_size 100 ...
</details>
<details>
<summary><b>Q: Model downloads are slow or failing?</b></summary>
```bash
# For users in China/Asia, use HF mirror
export HF_ENDPOINT=https://hf-mirror.com
python schema_action_querying.py ...
# Or download model manually first:
# 1. Go to https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
# 2. Download all files to ./local_models/Llama-3.2-3B-Instruct/
# 3. Edit line ~1420 in schema_action_querying.py to use local path
Q: Wrong results for date queries?
Check your metadata.json has proper date column descriptions:
"OrderDate": "Date when order was placed (YYYY-MM-DD format)",
"BirthDate": "Customer date of birth (MM-DD-YYYY format)"Ensure consistent date format in your CSV:
- Good:
2017-01-15,2017-02-20,2017-03-10 - Bad: Mix of
01/15/2017,2017-02-20,March 10, 2017
Use explicit date format in queries:
- Good: "Show orders on 2017-01-15"
- Avoid: "Show orders on Jan 15, 2017" (ambiguous)
Q: System can't find my tables?
Common issues:
- Filename mismatch - Check metadata.json
"name"EXACTLY matches CSV filename:
✅ "name": "customers.csv" (if file is customers.csv)
❌ "name": "Customers.csv" (case sensitive!)
❌ "name": "customers" (missing .csv extension)- Missing keywords - Add relevant search terms:
"keywords": ["customer", "client", "user", "buyer", "demographic"]- Debug mode - Run with debug flag to see selection process:
python schema_action_querying.py --debug ...Q: Queries are too slow (> 15s)?
Performance tips:
- First query is always slower (~10s) - subsequent queries use cache (< 1s)
- Enable 4-bit mode if using older GPU:
python schema_action_querying.py --use_4bit ...- Close other GPU programs (browsers, games, etc.)
- Check GPU utilization:
nvidia-smi # Should show model loaded, ~90%+ GPU usage during queryMore issues? Open a GitHub issue with:
- Your query
- Expected vs actual output
- Your
metadata.jsonstructure (redacted if sensitive) - GPU model and VRAM
- Core COUNT/FILTER/SORT operations
- Multi-table auto-join
- Dynamic date schema transformation
- LRU caching for repeated queries
- v0.2.0 (Q1 2025)
- AGGREGATE operations (SUM, AVG, GROUP BY)
- PostgreSQL/MySQL connector
- Streaming response for large results
- Web UI (Gradio interface)
- v0.3.0 (Q2 2025)
- Multi-hop reasoning (e.g., "customers who bought X but not Y")
- Explain mode (visualize decision process)
- REST API server mode
- Multi-language support (Mandarin, Spanish)
Contributions welcome! Priority areas:
High Priority:
- Support for SQL databases (PostgreSQL, MySQL, SQLite)
- AGGREGATE/GROUP BY operations
- Gradio Web UI with result visualization
Medium Priority:
- Additional model backends (Qwen, Mistral, Gemma)
- Performance benchmarking suite
- Docker container for easy deployment
Low Priority:
- Multi-language prompts
- Voice input support
- Export results to PDF/Excel
How to contribute:
- Fork the repo
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
If this work helps your research or project, please cite:
@software{schema_action_querying2024,
author = {Darren Chai Xin Lun},
title = {Schema-Action-Querying: Multi-Table Text-to-Action with Small Language Models},
year = {2024},
publisher = {GitHub},
url = {https://github.com/darrencxl0301/Schema-Action-Querying},
note = {Achieves good accuracy on multi-table queries with 3B models via code-first action planning}
}Related Work:
- If you use the Adventure Works dataset: Microsoft Adventure Works
- If you use Llama 3.2: Meta Llama 3.2
Apache License 2.0 - see LICENSE file for details.
Key permissions:
- ✅ Commercial use
- ✅ Modification
- ✅ Distribution
- ✅ Private use
Conditions:
- ℹ️ Include license and copyright notice
- ℹ️ State changes made to the code
- Meta AI for Llama 3.2 3B model
- Hugging Face for model hosting and dataset platform
- Microsoft for Adventure Works sample dataset
- The open-source community for feedback and contributions
Author: Darren Chai Xin Lun
Email: ddcxl0301@gmail.com
GitHub: @darrencxl0301
Questions?
Built with ❤️ for democratizing AI access to structured data