CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

QueryGrade is a comprehensive Django-based SQL query analysis and database optimization platform powered by machine learning. The application serves three primary purposes:

SQL Query Grading ✅: Users paste individual SQL queries to receive performance grades (A-F) with specific feedback and improvement recommendations
System Query Analysis 🚧: Analyze queries from database logs to identify optimization opportunities within database ecosystem context
Database Architecture Optimization 🚧: Analyze database architecture and recommend structural improvements

Current Implementation Status: Phase 1 (SQL Query Grading) is complete with a sophisticated hybrid ML system that learns from user feedback. The platform includes rule-based analysis, ML-powered predictions, comprehensive feedback collection, and automated model training.

Core Architecture

Django Project Structure

Main project: querygrade/ - Django configuration, Celery setup
Main app: analyzer/ - Core query analysis functionality
Key modules:
- analyzer/query_analyzer.py - Core query grading engine facade (delegates to modular analyzers)
- analyzer/analyzers/ - NEW: Modular analyzer architecture (9 specialized analyzers):
  - base.py - Base analyzer class and orchestrator
  - select_analyzer.py - SELECT clause efficiency
  - join_analyzer.py - JOIN analysis
  - where_analyzer.py - WHERE clause optimization
  - indexing_analyzer.py - NEW: Index detection and optimization
  - subquery_analyzer.py - NEW: Subquery pattern optimization
  - orderby_analyzer.py - NEW: Sorting efficiency
  - groupby_analyzer.py - NEW: Aggregation optimization
  - database/ - Database-specific analyzers (MySQL, PostgreSQL, SQLite, Oracle, SQL Server)
- analyzer/views/ - Modular views package:
  - auth_views.py - Authentication flows
  - query_grading_views.py - Query analysis
  - history_views.py - User history
  - feedback_views.py - Feedback collection
  - upload_views.py - File processing
- analyzer/models.py - Comprehensive models for queries, analysis, feedback, ML tracking
- analyzer/forms.py - Query input, feedback collection, database connection forms
- analyzer/ml/ - Machine learning subsystem:
  - hybrid_grader.py - Hybrid rule-based + ML grading system
  - feature_extractor.py - Extracts 41+ features from SQL queries
  - feedback_collector.py - Processes user feedback for ML training
  - training_pipeline.py - Automated model training and deployment
  - documentation_loader.py - Imports best practices from SQL documentation
- analyzer/tasks.py - Celery async tasks for heavy processing
- analyzer/api_views.py - REST API endpoints
- analyzer/management/commands/ - CLI commands for ML operations

Query Grading Data Flow (Primary Feature)

Query Submission: User pastes SQL query via web form (QueryGradeForm)
Analysis Phase:
- Query normalized and hashed for caching
- QueryGrader.analyze_query() orchestrates 9 specialized analyzers
- Each analyzer examines specific aspects (SELECT, JOIN, WHERE, ORDER BY, GROUP BY, indexes, subqueries)
- Detects 18+ issue types across performance, efficiency, and best practices
- Generates 27+ recommendation types with actionable examples
- Calculates base score (0-100) and letter grade (A-F)
ML Enhancement (if enabled):
- FeatureExtractor extracts 41+ numerical features
- HybridQueryGrader combines rule-based score with ML prediction
- Confidence-based weighting adjusts between rules and ML
Results Display:
- Grade, score, issues list, and recommendations shown
- Quick feedback UI (thumbs up/down) for learning
- Query saved to UserQueryHistory for user
Feedback Loop:
- User feedback collected via QueryFeedback model
- FeedbackCollector processes feedback into training data
- Periodic retraining improves ML predictions

Analyzer Architecture (NEW - Phase 1 Complete)

9 Specialized Analyzers using Strategy Pattern:

Core Clause Analyzers:

SelectAnalyzer - SELECT clause efficiency (SELECT , DISTINCT, COUNT(), scalar subqueries)
JoinAnalyzer - JOIN analysis (types, conditions, cross joins, implicit joins)
WhereAnalyzer - WHERE clause optimization (functions on columns, OR conditions, type mismatches)
OrderByAnalyzer - Sorting efficiency (ORDER BY without LIMIT, functions, expressions, RAND())
GroupByAnalyzer - Aggregation optimization (GROUP BY indexes, HAVING vs WHERE, DISTINCT in aggregates)

Performance Analyzers: 6. IndexingAnalyzer - Index detection (missing indexes, LIKE wildcards, composite indexes, covering indexes) 7. SubqueryAnalyzer - Subquery patterns (correlated subqueries, scalar subqueries, IN vs EXISTS, CTEs)

Database-Specific Analyzers: 8. MySQLAnalyzer - MySQL patterns (storage engines, full-text search, query hints) 9. PostgreSQLAnalyzer - PostgreSQL features (window functions, DISTINCT ON, advanced indexes)

Detection Capabilities (18+ issue types):

High Severity: LIKE leading wildcard, function on indexed columns, scalar subqueries, NOT IN, ORDER BY RAND()
Medium Severity: SELECT *, functions in WHERE/ORDER BY/GROUP BY, correlated subqueries, HAVING without aggregates
Low Severity: Many GROUP BY columns, DISTINCT with GROUP BY, ORDER BY in subquery

Recommendation Types (27+ types):

Index-related (7): WHERE, range, JOIN, ORDER BY, GROUP BY, composite, covering
Query rewriting (6): JOIN instead of subquery, NOT EXISTS, CTEs, eliminate scalar subqueries
Function optimization (4): Avoid functions on columns, computed columns for ORDER BY/GROUP BY
Filter optimization (2): Move HAVING to WHERE, column order by selectivity
Other optimizations (8+): ADD LIMIT, remove DISTINCT, COUNT optimization, etc.

Machine Learning System

Hybrid Grading Architecture:

Starts with rule-based grading (QueryGrader)
ML model trained on user feedback + documentation benchmarks
Dynamic weighting: high confidence → more ML, low confidence → more rules
Models stored in MLModel table with versioning and performance tracking

Feature Engineering (41+ features):

Query structure: table count, join count, where conditions, subqueries
Complexity metrics: nesting depth, aggregation usage, distinct operations
Performance indicators: SELECT *, function on columns, wildcards in LIKE
Database-specific patterns: storage engines, index hints, lock types

Training Pipeline:

Automated via train_ml_model management command
Algorithms: Random Forest, Gradient Boosting, XGBoost, LightGBM
Validation: Cross-validation, train/test split, performance benchmarking
Deployment: Best model auto-deployed to production

Feedback Collection:

Quick feedback: thumbs up/down on analysis accuracy
Detailed feedback: ratings for accuracy, usefulness, clarity
User reliability scoring: consistent users weighted higher
Feedback converted to training samples via FeedbackLearning model

Development Commands

Local Development Setup

# Install dependencies (includes ML libraries: tensorflow, torch, transformers, xgboost, lightgbm)
pip install -r requirements.txt

# Database migrations (sets up Query, QueryAnalysis, MLModel, TrainingData, etc.)
python manage.py migrate

# Create superuser for admin access
python manage.py createsuperuser

# Collect static files for production
python manage.py collectstatic --noinput

# Run development server
python manage.py runserver

# Start Redis (required for Celery and caching)
redis-server

# Start Celery worker (for async tasks)
celery -A querygrade worker -l info

# Start Celery beat (for scheduled tasks)
celery -A querygrade beat -l info

ML System Commands

# Train ML model with user feedback
python manage.py train_ml_model --algorithm random_forest --min-samples 50

# Train with specific algorithm and validation
python manage.py train_ml_model --algorithm xgboost --validation-split 0.2 --cross-validation 5

# Process user feedback into training data
python manage.py process_ml_feedback --min-reliability 0.6

# Load best practices from SQL documentation
python manage.py load_documentation --source mysql --url https://dev.mysql.com/doc/

# View ML system analytics
python manage.py ml_analytics --metrics accuracy,f1_score,user_satisfaction

# Manage ML models (list, activate, deactivate, cleanup)
python manage.py manage_ml_models list
python manage.py manage_ml_models activate <model_id>
python manage.py manage_ml_models cleanup --keep 5

Testing

# Run all tests (includes unit, integration, ML tests)
python manage.py test

# Run specific test suites
python manage.py test analyzer                           # All analyzer tests
python manage.py test analyzer.tests.TestQueryGrader     # Query grading tests
python manage.py test analyzer.ml.tests                  # ML system tests
python manage.py test analyzer.test_api                  # API tests

# Run specific test files
python manage.py test analyzer.test_query_grader         # Query grading
python manage.py test analyzer.test_integration          # Integration tests
python manage.py test analyzer.ml.tests.test_hybrid_grader  # Hybrid ML tests

Docker Development

# Build and run with Docker Compose (includes PostgreSQL and Redis)
docker-compose up --build

# Run in detached mode
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Configuration

Environment Variables

# Database (Production)
DB_ENGINE=django.db.backends.postgresql
DB_NAME=querygrade
DB_USER=postgres
DB_PASSWORD=your_password
DB_HOST=localhost
DB_PORT=5432

# Redis/Celery
REDIS_URL=redis://localhost:6379/1
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/0

# ML System
ML_ENABLED=True                    # Enable ML features
ML_HYBRID_GRADING=True            # Use hybrid rule+ML grading
ML_AUTO_RETRAIN=True              # Automatic model retraining
ML_FEEDBACK_COLLECTION=True       # Collect user feedback
ML_MIN_TRAINING_SAMPLES=50        # Minimum samples before training
ML_RETRAIN_THRESHOLD_DAYS=7       # Days between retraining checks
ML_PERFORMANCE_THRESHOLD=0.7      # Minimum model accuracy

# Security
DEBUG=False                        # NEVER True in production
SECRET_KEY=your_secret_key_here
ALLOWED_HOSTS=yourdomain.com,www.yourdomain.com

Redis Cache Configuration

The system uses 4 separate Redis databases for different caching purposes:

DB 1: Default cache (query results, general caching)
DB 2: Process cache (ML results, feature extractions)
DB 3: Query analysis cache (analysis results, 2-hour TTL)
DB 4: Template cache (rendered templates, 30-minute TTL)

ML System Settings (settings.py)

ML_ENABLED = True                      # Global ML toggle
ML_MODEL_PATH = 'analyzer/ml/models'   # Model storage location
ML_MIN_TRAINING_SAMPLES = 50           # Min samples for training
ML_RETRAIN_THRESHOLD_DAYS = 7          # Retraining frequency
ML_PERFORMANCE_THRESHOLD = 0.7         # Min accuracy threshold

Key Dependencies

Core Framework

Django 4.0-5.0 - Web framework
djangorestframework 3.14+ - REST API
djangorestframework-simplejwt 5.2+ - JWT authentication
celery 5.2+ - Async task processing
redis 4.0+ - Caching and message broker
gunicorn 20.1 - WSGI server
psycopg2-binary 2.9+ - PostgreSQL adapter

Machine Learning Stack

sklearn: Random Forest, Gradient Boosting, feature scaling
xgboost 1.7+: Gradient boosting optimization
lightgbm 3.3+: Efficient gradient boosting
tensorflow 2.10+: Deep learning (future neural network models)
torch 1.13+: PyTorch (flexible model architectures)
transformers 4.25+: NLP for query semantic analysis
sentence-transformers 2.2+: Query embedding generation
joblib 1.2+: Model serialization and caching

Data Processing

pandas 2.2+ - Data manipulation
numpy 1.26+ - Numerical operations
sqlparse 0.4+ - SQL parsing and analysis

Documentation & Web

beautifulsoup4 4.12+ - HTML parsing for documentation loading
requests 2.31+ - HTTP requests for external docs

API Endpoints

REST API (requires authentication)

POST /api/analyze/                    # Analyze single query
POST /api/batch-analyze/              # Batch query analysis
GET  /api/history/                    # User query history
POST /api/feedback/<analysis_id>/     # Submit feedback
GET  /api/models/                     # List ML models
GET  /api/analytics/                  # ML performance metrics

Web Views

GET  /                                # Home (log upload)
GET  /grade/                          # Query grading interface
POST /grade/                          # Submit query for grading
GET  /grade/results/<id>/             # View analysis results
POST /feedback/<id>/                  # Submit detailed feedback
POST /feedback/quick/<id>/            # Quick thumbs up/down
GET  /history/                        # User query history
GET  /ml/dashboard/                   # ML performance dashboard (admin)
GET  /login/                          # User login
GET  /register/                       # User registration

Testing Strategy

📚 For comprehensive testing guidance, see:

TESTING.md - Complete testing guide with examples, best practices, and troubleshooting

INTEGRATION_TEST_FIX_SUMMARY.md - Case study of solving cache initialization issues in tests

Test Coverage

Unit Tests: Individual components (QueryGrader, FeatureExtractor, etc.)
Integration Tests: End-to-end workflows (query submission → grading → feedback)
ML Tests: Model training, prediction, feedback processing
API Tests: REST endpoint validation
Database Tests: Model relationships, constraints, queries

Key Test Files

analyzer/test_query_grader.py - Query analysis logic (28 tests)
analyzer/test_integration_refactored.py - Full workflow tests with proper cache handling (5 tests)
analyzer/test_integration.py - Legacy integration tests (deprecated, use refactored version)
analyzer/test_api.py - API endpoint tests
analyzer/ml/tests/test_hybrid_grader.py - ML grading tests
analyzer/ml/tests/test_feature_extractor.py - Feature engineering
analyzer/ml/tests/test_feedback_collector.py - Feedback processing

Testing Best Practices

Cache Management in Tests

Critical: The global query_cache singleton in analyzer/performance.py is instantiated at module import time, before test settings apply. This causes tests to use production cache instead of DummyCache.

Solution: Always reinitialize cache in test setUp():

def setUp(self):
    from analyzer.performance import query_cache
    from django.core.cache import caches

    # Force query_cache to use test cache backend
    query_cache.cache = caches['query_analysis_cache']

    # Clear all caches
    for cache_name in ['default', 'query_analysis_cache', 'process_cache', 'template_cache']:
        try:
            caches[cache_name].clear()
        except:
            pass

TransactionTestCase vs TestCase

When ATOMIC_REQUESTS=True (production setting), use TransactionTestCase for integration tests:

TestCase wraps each test in a transaction, conflicting with ATOMIC_REQUESTS
TransactionTestCase allows proper transaction control
Requires manual cleanup in tearDown() (no automatic rollback)
Use factory methods for consistent test object creation

Example:

from django.test import TransactionTestCase

class IntegrationTestCase(TransactionTestCase):
    def tearDown(self):
        # Manual cleanup required
        UserQueryHistory.objects.all().delete()
        QueryAnalysis.objects.all().delete()
        Query.objects.all().delete()
        User.objects.all().delete()

Fetching Test Objects

Always fetch objects by explicit ID to avoid cache interference:

# ❌ BAD - may return cached objects
simple_query_obj = Query.objects.first()
complex_query_obj = Query.objects.last()

# ✅ GOOD - fetch by ID from response
analysis_id = int(response.url.split('/')[-2])
analysis = QueryAnalysis.objects.get(id=analysis_id)
query_obj = analysis.query

Test Configuration

Required @override_settings for integration tests:

@override_settings(
    RATELIMIT_ENABLE=False,  # Disable rate limiting
    CACHES={
        'default': {'BACKEND': 'django.core.cache.backends.dummy.DummyCache'},
        'query_analysis_cache': {'BACKEND': 'django.core.cache.backends.dummy.DummyCache'},
        'process_cache': {'BACKEND': 'django.core.cache.backends.dummy.DummyCache'},
        'template_cache': {'BACKEND': 'django.core.cache.backends.dummy.DummyCache'}
    }
)

Performance Considerations

Caching Strategy

Query analysis results cached by query hash (2 hours)
ML predictions cached with feature fingerprints
Template fragments cached (30 minutes)
Redis used for all cache backends with compression

Async Processing

Heavy operations (log processing, model training) via Celery
Task queues: heavy_processing, light_processing, maintenance
Worker memory limits: 200MB per child, 1000 tasks per child

Database Optimization

Connection pooling enabled (CONN_MAX_AGE=600)
Indexed fields: query_hash, user+timestamp, grades, scores
Database query optimization via select_related() and prefetch_related()
SQLite WAL mode for development

Security Features

XSS Protection

Content Security Policy (CSP) configured
Script/style sources restricted to self + CDN
Django template auto-escaping enabled
Custom security middleware for enhanced protection

CSRF Protection

CSRF tokens required for all POST requests
Cookie-based CSRF with 1-hour expiration
Custom CSRF failure handling

Authentication

User authentication required for query submission
JWT tokens for API access
Session cookies with HTTPOnly, Secure, SameSite flags
Password validation: 12+ chars, not common, not similar to username

Rate Limiting

IP-based rate limiting: 5 requests/minute for unauthenticated
User-based rate limiting: 10 requests/minute for authenticated
API throttling: 100/hour anonymous, 1000/hour authenticated

Development Workflow Tips

Working with ML Models

Collect feedback: Users submit queries and provide ratings
Process feedback: python manage.py process_ml_feedback
Train model: python manage.py train_ml_model --algorithm xgboost
Monitor performance: Visit /ml/dashboard/ or run ml_analytics command
Model auto-deployed if accuracy > threshold

Debugging Query Analysis

Set DEBUG=True in settings.py
Check logs in logs/security.log

Use Django shell: python manage.py shell

from analyzer.query_analyzer import QueryGrader
grader = QueryGrader()
query, analysis = grader.analyze_query("SELECT * FROM users")
print(analysis.grade, analysis.score, analysis.issues_found)

For debugging test failures, see:

Adding New Analysis Rules

The analyzer uses a modular architecture with specialized analyzer classes. To add new rules:

Option 1: Add to existing analyzer

Identify relevant analyzer in analyzer/analyzers/ (e.g., where_analyzer.py, join_analyzer.py)
Add detection logic to the analyzer's analyze() method
Append issues/recommendations to context.issues or context.recommendations
Update scoring in analyzer/analyzers/base.py _calculate_score() if needed
Add test case to analyzer/test_query_grader.py
Run tests: python manage.py test analyzer.test_query_grader

Option 2: Create new analyzer

Create new file in analyzer/analyzers/ (e.g., security_analyzer.py)
Inherit from BaseAnalyzer and implement analyze() and name property
Register in analyzer/analyzers/base.py _initialize_analyzers()
Add comprehensive test cases
Run full test suite: python manage.py test analyzer

Example new analyzer:

from .base import BaseAnalyzer, AnalysisContext

class SecurityAnalyzer(BaseAnalyzer):
    @property
    def name(self) -> str:
        return "SecurityAnalyzer"

    def analyze(self, context: AnalysisContext) -> None:
        sql_upper = context.sql_text.upper()

        # Detect SQL injection patterns
        if re.search(r";\s*DROP\s+TABLE", sql_upper):
            context.issues.append({
                'type': 'sql_injection',
                'severity': 'critical',
                'message': 'Potential SQL injection detected'
            })

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Core Architecture

Django Project Structure

Query Grading Data Flow (Primary Feature)

Analyzer Architecture (NEW - Phase 1 Complete)

Machine Learning System

Development Commands

Local Development Setup

ML System Commands

Testing

Docker Development

Configuration

Environment Variables

Redis Cache Configuration

ML System Settings (settings.py)

Key Dependencies

Core Framework

Machine Learning Stack

Data Processing

Documentation & Web

API Endpoints

REST API (requires authentication)

Web Views

Testing Strategy

Test Coverage

Key Test Files

Testing Best Practices

Cache Management in Tests

TransactionTestCase vs TestCase

Fetching Test Objects

Test Configuration

Performance Considerations

Caching Strategy

Async Processing

Database Optimization

Security Features

XSS Protection

CSRF Protection

Authentication

Rate Limiting

Development Workflow Tips

Working with ML Models

Debugging Query Analysis

Adding New Analysis Rules