Skip to content

nattu22/Agentic-ai-NLP2SQL

Repository files navigation

NLP-to-SQL Streamlit Agent

A production-ready Streamlit app that converts natural language queries to SQL using OpenAI, validates them with sqlglot, and executes them against configured databases. Includes three orchestration modes: direct pipeline, LangChain multi-tool agent, and full agentic pipeline with step tracing and reasoning logs.

Overview

Features:

  • 🔗 Connect to multiple databases via connection strings (PostgreSQL, MySQL, SQLite)
  • 🔍 Dynamic schema introspection and semantic model generation (with table samples)
  • 🧠 NLP→SQL generation via OpenAI with dialect-aware validation
  • ✅ Automatic retry (up to 5 times) to fix invalid SQL
  • 🛡️ SQL policy verification (read-only SELECT/DESCRIBE/EXPLAIN/SHOW only)
  • 📊 Execute queries and summarize results (detailed stats or aggregations)
  • 🤖 Three orchestration modes:
    • Direct Mode: Simple generate→validate→execute flow
    • LangChain Agent: Multi-tool orchestrator using LangChain
    • Full Agentic Pipeline: Complete step tracing, reasoning logs, and insights

Files

File Purpose
app.py Streamlit UI with database config, semantic model builder, orchestration mode selector
db.py DB connection, introspection, schema sampling, and query execution helpers
semantic.py Build semantic model enriched with table samples
nlp_sql.py OpenAI SQL generation with sqlglot validation and retry logic
agent.py SQL verifier: enforce read-only policy, reject CRUD/DDL
insights.py Data summarization and narrative insight generation via OpenAI
agent_chain.py LangChain-based tool orchestrator
agentic_pipeline.py Full end-to-end pipeline with step tracing and reasoning
test_agent.py Unit tests for agent functions (pytest)
setup_sample_db.py Create sample SQLite database for testing
dry_run.py Dry-run example without Streamlit
requirements.txt Python dependencies

Quick Start

1. Setup

# Create virtualenv
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Environment Variables

# Required: OpenAI API key
export OPENAI_API_KEY="sk-..."

# Optional: OpenAI model (default: gpt-4)
export OPENAI_MODEL="gpt-4"

# Configure database URLs (JSON map)
export DB_URLS='{"default":"sqlite:////workspaces/nattu22/data.db"}'

3. Create Sample Database (optional)

python setup_sample_db.py

4. Run Tests

pytest test_agent.py -v

5. Dry-Run Example

python dry_run.py

6. Run Streamlit App

streamlit run app.py

Then open http://localhost:8501 in your browser.

Usage Flow

Direct Mode

  1. Introspect → Fetch database schema
  2. Semantic → Build model with table samples
  3. Generate SQL → OpenAI translates NL to SQL
  4. Validate → Check syntax and policy
  5. Execute → Run query with row limit
  6. Summarize → Compute statistics or aggregations
  7. Insights → OpenAI generates narrative

LangChain Agent Mode

  • Uses LangChain's ZERO_SHOT_REACT_DESCRIPTION agent
  • Exposes tools: generate_sql, validate_sql, execute_sql, narrative_insights
  • Agent decides which tools to use and in what order

Full Agentic Pipeline Mode (Recommended)

  • Step-by-step orchestration with full tracing
  • Each step includes:
    • input_data: What the step received
    • output_data: What the step produced
    • reasoning: Why the step was executed
    • duration_ms: Execution time
    • error: Any error message
  • All traces stored and displayed in collapsible UI

Architecture

User Query (NL)
    ↓
[Introspect Schema] → Get DB structure
    ↓
[Build Semantic Model] → Enrich with samples
    ↓
[Generate SQL] → OpenAI + dialect-aware
    ↓
[Validate Syntax] → sqlglot parsing
    ↓
[Verify Policy] → Read-only check (agent.py)
    ↓
[Execute] → Run SELECT with row limit
    ↓
[Summarize] → describe() + percentiles or aggregates
    ↓
[Narrative] → OpenAI generates insights
    ↓
Result + Traces + Reasoning Logs

Security & Safety

  • ✓ SQL validation with sqlglot (syntax parsing)
  • ✓ Policy enforcement: SELECT/DESCRIBE/EXPLAIN/SHOW only (agent.py)
  • ✓ Block CRUD/DDL keywords: INSERT, UPDATE, DELETE, DROP, ALTER, etc.
  • ✓ Reject multiple statements (semicolon check)
  • ✓ Comment stripping to prevent injection bypass
  • ⚠️ For production: use read-only DB user, row limits, query timeouts, connection pooling, audit logging

Environment Setup Examples

Example 1: Local SQLite

export DB_URLS='{"test":"sqlite:////tmp/test.db"}'

Example 2: PostgreSQL

export DB_URLS='{"prod":"postgresql://user:pass@localhost:5432/mydb"}'

Example 3: Multiple databases

export DB_URLS='{"sqlite":"sqlite:////tmp/test.db","postgres":"postgresql://user:pass@db.example.com/prod"}'

Next Steps & Enhancements

  • Add conversation memory for multi-turn refinement
  • Integrate with more LLM providers (Claude, Llama, etc.)
  • Add query caching and results memoization
  • Implement row-level security (RLS) filters
  • Add audit logging and query history
  • Deploy with authentication (OAuth2, JWT)
  • Support for time-series and advanced SQL functions
  • Add chart generation from results (Plotly, Matplotlib)

License

Prototype / Educational Use


Note: This is a prototype. Review and harden the security layer before using with production databases. pip install -r requirements.txt cp .env.example .env

set OPENAI_API_KEY in .env

uvicorn main:app --host 0.0.0.0 --port 8000


Quick start (frontend)

1. Go to `frontend` and install dependencies (Node 18+ recommended):

```bash
cd frontend
npm install
VITE_BACKEND_URL="http://localhost:8000" npm run dev

Now open the frontend URL shown by Vite. Create a session, record voice or type messages, and interact with the agent.

  • Add authentication, query logging, and row-level limits.

nattu22

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published