This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
QueryGrade is a comprehensive Django-based SQL query analysis and database optimization platform powered by machine learning. The application serves three primary purposes:
- SQL Query Grading ✅: Users paste individual SQL queries to receive performance grades (A-F) with specific feedback and improvement recommendations
- System Query Analysis 🚧: Analyze queries from database logs to identify optimization opportunities within database ecosystem context
- Database Architecture Optimization 🚧: Analyze database architecture and recommend structural improvements
Current Implementation Status: Phase 1 (SQL Query Grading) is complete with a sophisticated hybrid ML system that learns from user feedback. The platform includes rule-based analysis, ML-powered predictions, comprehensive feedback collection, and automated model training.
- Main project:
querygrade/- Django configuration, Celery setup - Main app:
analyzer/- Core query analysis functionality - Key modules:
analyzer/query_analyzer.py- Core query grading engine facade (delegates to modular analyzers)analyzer/analyzers/- NEW: Modular analyzer architecture (9 specialized analyzers):base.py- Base analyzer class and orchestratorselect_analyzer.py- SELECT clause efficiencyjoin_analyzer.py- JOIN analysiswhere_analyzer.py- WHERE clause optimizationindexing_analyzer.py- NEW: Index detection and optimizationsubquery_analyzer.py- NEW: Subquery pattern optimizationorderby_analyzer.py- NEW: Sorting efficiencygroupby_analyzer.py- NEW: Aggregation optimizationdatabase/- Database-specific analyzers (MySQL, PostgreSQL, SQLite, Oracle, SQL Server)
analyzer/views/- Modular views package:auth_views.py- Authentication flowsquery_grading_views.py- Query analysishistory_views.py- User historyfeedback_views.py- Feedback collectionupload_views.py- File processing
analyzer/models.py- Comprehensive models for queries, analysis, feedback, ML trackinganalyzer/forms.py- Query input, feedback collection, database connection formsanalyzer/ml/- Machine learning subsystem:hybrid_grader.py- Hybrid rule-based + ML grading systemfeature_extractor.py- Extracts 41+ features from SQL queriesfeedback_collector.py- Processes user feedback for ML trainingtraining_pipeline.py- Automated model training and deploymentdocumentation_loader.py- Imports best practices from SQL documentation
analyzer/tasks.py- Celery async tasks for heavy processinganalyzer/api_views.py- REST API endpointsanalyzer/management/commands/- CLI commands for ML operations
- Query Submission: User pastes SQL query via web form (
QueryGradeForm) - Analysis Phase:
- Query normalized and hashed for caching
QueryGrader.analyze_query()orchestrates 9 specialized analyzers- Each analyzer examines specific aspects (SELECT, JOIN, WHERE, ORDER BY, GROUP BY, indexes, subqueries)
- Detects 18+ issue types across performance, efficiency, and best practices
- Generates 27+ recommendation types with actionable examples
- Calculates base score (0-100) and letter grade (A-F)
- ML Enhancement (if enabled):
FeatureExtractorextracts 41+ numerical featuresHybridQueryGradercombines rule-based score with ML prediction- Confidence-based weighting adjusts between rules and ML
- Results Display:
- Grade, score, issues list, and recommendations shown
- Quick feedback UI (thumbs up/down) for learning
- Query saved to
UserQueryHistoryfor user
- Feedback Loop:
- User feedback collected via
QueryFeedbackmodel FeedbackCollectorprocesses feedback into training data- Periodic retraining improves ML predictions
- User feedback collected via
9 Specialized Analyzers using Strategy Pattern:
Core Clause Analyzers:
- SelectAnalyzer - SELECT clause efficiency (SELECT , DISTINCT, COUNT(), scalar subqueries)
- JoinAnalyzer - JOIN analysis (types, conditions, cross joins, implicit joins)
- WhereAnalyzer - WHERE clause optimization (functions on columns, OR conditions, type mismatches)
- OrderByAnalyzer - Sorting efficiency (ORDER BY without LIMIT, functions, expressions, RAND())
- GroupByAnalyzer - Aggregation optimization (GROUP BY indexes, HAVING vs WHERE, DISTINCT in aggregates)
Performance Analyzers: 6. IndexingAnalyzer - Index detection (missing indexes, LIKE wildcards, composite indexes, covering indexes) 7. SubqueryAnalyzer - Subquery patterns (correlated subqueries, scalar subqueries, IN vs EXISTS, CTEs)
Database-Specific Analyzers: 8. MySQLAnalyzer - MySQL patterns (storage engines, full-text search, query hints) 9. PostgreSQLAnalyzer - PostgreSQL features (window functions, DISTINCT ON, advanced indexes)
Detection Capabilities (18+ issue types):
- High Severity: LIKE leading wildcard, function on indexed columns, scalar subqueries, NOT IN, ORDER BY RAND()
- Medium Severity: SELECT *, functions in WHERE/ORDER BY/GROUP BY, correlated subqueries, HAVING without aggregates
- Low Severity: Many GROUP BY columns, DISTINCT with GROUP BY, ORDER BY in subquery
Recommendation Types (27+ types):
- Index-related (7): WHERE, range, JOIN, ORDER BY, GROUP BY, composite, covering
- Query rewriting (6): JOIN instead of subquery, NOT EXISTS, CTEs, eliminate scalar subqueries
- Function optimization (4): Avoid functions on columns, computed columns for ORDER BY/GROUP BY
- Filter optimization (2): Move HAVING to WHERE, column order by selectivity
- Other optimizations (8+): ADD LIMIT, remove DISTINCT, COUNT optimization, etc.
Hybrid Grading Architecture:
- Starts with rule-based grading (QueryGrader)
- ML model trained on user feedback + documentation benchmarks
- Dynamic weighting: high confidence → more ML, low confidence → more rules
- Models stored in
MLModeltable with versioning and performance tracking
Feature Engineering (41+ features):
- Query structure: table count, join count, where conditions, subqueries
- Complexity metrics: nesting depth, aggregation usage, distinct operations
- Performance indicators: SELECT *, function on columns, wildcards in LIKE
- Database-specific patterns: storage engines, index hints, lock types
Training Pipeline:
- Automated via
train_ml_modelmanagement command - Algorithms: Random Forest, Gradient Boosting, XGBoost, LightGBM
- Validation: Cross-validation, train/test split, performance benchmarking
- Deployment: Best model auto-deployed to production
Feedback Collection:
- Quick feedback: thumbs up/down on analysis accuracy
- Detailed feedback: ratings for accuracy, usefulness, clarity
- User reliability scoring: consistent users weighted higher
- Feedback converted to training samples via
FeedbackLearningmodel
# Install dependencies (includes ML libraries: tensorflow, torch, transformers, xgboost, lightgbm)
pip install -r requirements.txt
# Database migrations (sets up Query, QueryAnalysis, MLModel, TrainingData, etc.)
python manage.py migrate
# Create superuser for admin access
python manage.py createsuperuser
# Collect static files for production
python manage.py collectstatic --noinput
# Run development server
python manage.py runserver
# Start Redis (required for Celery and caching)
redis-server
# Start Celery worker (for async tasks)
celery -A querygrade worker -l info
# Start Celery beat (for scheduled tasks)
celery -A querygrade beat -l info# Train ML model with user feedback
python manage.py train_ml_model --algorithm random_forest --min-samples 50
# Train with specific algorithm and validation
python manage.py train_ml_model --algorithm xgboost --validation-split 0.2 --cross-validation 5
# Process user feedback into training data
python manage.py process_ml_feedback --min-reliability 0.6
# Load best practices from SQL documentation
python manage.py load_documentation --source mysql --url https://dev.mysql.com/doc/
# View ML system analytics
python manage.py ml_analytics --metrics accuracy,f1_score,user_satisfaction
# Manage ML models (list, activate, deactivate, cleanup)
python manage.py manage_ml_models list
python manage.py manage_ml_models activate <model_id>
python manage.py manage_ml_models cleanup --keep 5# Run all tests (includes unit, integration, ML tests)
python manage.py test
# Run specific test suites
python manage.py test analyzer # All analyzer tests
python manage.py test analyzer.tests.TestQueryGrader # Query grading tests
python manage.py test analyzer.ml.tests # ML system tests
python manage.py test analyzer.test_api # API tests
# Run specific test files
python manage.py test analyzer.test_query_grader # Query grading
python manage.py test analyzer.test_integration # Integration tests
python manage.py test analyzer.ml.tests.test_hybrid_grader # Hybrid ML tests# Build and run with Docker Compose (includes PostgreSQL and Redis)
docker-compose up --build
# Run in detached mode
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down# Database (Production)
DB_ENGINE=django.db.backends.postgresql
DB_NAME=querygrade
DB_USER=postgres
DB_PASSWORD=your_password
DB_HOST=localhost
DB_PORT=5432
# Redis/Celery
REDIS_URL=redis://localhost:6379/1
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/0
# ML System
ML_ENABLED=True # Enable ML features
ML_HYBRID_GRADING=True # Use hybrid rule+ML grading
ML_AUTO_RETRAIN=True # Automatic model retraining
ML_FEEDBACK_COLLECTION=True # Collect user feedback
ML_MIN_TRAINING_SAMPLES=50 # Minimum samples before training
ML_RETRAIN_THRESHOLD_DAYS=7 # Days between retraining checks
ML_PERFORMANCE_THRESHOLD=0.7 # Minimum model accuracy
# Security
DEBUG=False # NEVER True in production
SECRET_KEY=your_secret_key_here
ALLOWED_HOSTS=yourdomain.com,www.yourdomain.comThe system uses 4 separate Redis databases for different caching purposes:
- DB 1: Default cache (query results, general caching)
- DB 2: Process cache (ML results, feature extractions)
- DB 3: Query analysis cache (analysis results, 2-hour TTL)
- DB 4: Template cache (rendered templates, 30-minute TTL)
ML_ENABLED = True # Global ML toggle
ML_MODEL_PATH = 'analyzer/ml/models' # Model storage location
ML_MIN_TRAINING_SAMPLES = 50 # Min samples for training
ML_RETRAIN_THRESHOLD_DAYS = 7 # Retraining frequency
ML_PERFORMANCE_THRESHOLD = 0.7 # Min accuracy threshold- Django 4.0-5.0 - Web framework
- djangorestframework 3.14+ - REST API
- djangorestframework-simplejwt 5.2+ - JWT authentication
- celery 5.2+ - Async task processing
- redis 4.0+ - Caching and message broker
- gunicorn 20.1 - WSGI server
- psycopg2-binary 2.9+ - PostgreSQL adapter
- sklearn: Random Forest, Gradient Boosting, feature scaling
- xgboost 1.7+: Gradient boosting optimization
- lightgbm 3.3+: Efficient gradient boosting
- tensorflow 2.10+: Deep learning (future neural network models)
- torch 1.13+: PyTorch (flexible model architectures)
- transformers 4.25+: NLP for query semantic analysis
- sentence-transformers 2.2+: Query embedding generation
- joblib 1.2+: Model serialization and caching
- pandas 2.2+ - Data manipulation
- numpy 1.26+ - Numerical operations
- sqlparse 0.4+ - SQL parsing and analysis
- beautifulsoup4 4.12+ - HTML parsing for documentation loading
- requests 2.31+ - HTTP requests for external docs
POST /api/analyze/ # Analyze single query
POST /api/batch-analyze/ # Batch query analysis
GET /api/history/ # User query history
POST /api/feedback/<analysis_id>/ # Submit feedback
GET /api/models/ # List ML models
GET /api/analytics/ # ML performance metricsGET / # Home (log upload)
GET /grade/ # Query grading interface
POST /grade/ # Submit query for grading
GET /grade/results/<id>/ # View analysis results
POST /feedback/<id>/ # Submit detailed feedback
POST /feedback/quick/<id>/ # Quick thumbs up/down
GET /history/ # User query history
GET /ml/dashboard/ # ML performance dashboard (admin)
GET /login/ # User login
GET /register/ # User registration📚 For comprehensive testing guidance, see:
- TESTING.md - Complete testing guide with examples, best practices, and troubleshooting
- INTEGRATION_TEST_FIX_SUMMARY.md - Case study of solving cache initialization issues in tests
- Unit Tests: Individual components (QueryGrader, FeatureExtractor, etc.)
- Integration Tests: End-to-end workflows (query submission → grading → feedback)
- ML Tests: Model training, prediction, feedback processing
- API Tests: REST endpoint validation
- Database Tests: Model relationships, constraints, queries
analyzer/test_query_grader.py- Query analysis logic (28 tests)analyzer/test_integration_refactored.py- Full workflow tests with proper cache handling (5 tests)analyzer/test_integration.py- Legacy integration tests (deprecated, use refactored version)analyzer/test_api.py- API endpoint testsanalyzer/ml/tests/test_hybrid_grader.py- ML grading testsanalyzer/ml/tests/test_feature_extractor.py- Feature engineeringanalyzer/ml/tests/test_feedback_collector.py- Feedback processing
Critical: The global query_cache singleton in analyzer/performance.py is instantiated at module import time, before test settings apply. This causes tests to use production cache instead of DummyCache.
Solution: Always reinitialize cache in test setUp():
def setUp(self):
from analyzer.performance import query_cache
from django.core.cache import caches
# Force query_cache to use test cache backend
query_cache.cache = caches['query_analysis_cache']
# Clear all caches
for cache_name in ['default', 'query_analysis_cache', 'process_cache', 'template_cache']:
try:
caches[cache_name].clear()
except:
passWhen ATOMIC_REQUESTS=True (production setting), use TransactionTestCase for integration tests:
TestCasewraps each test in a transaction, conflicting withATOMIC_REQUESTSTransactionTestCaseallows proper transaction control- Requires manual cleanup in
tearDown()(no automatic rollback) - Use factory methods for consistent test object creation
Example:
from django.test import TransactionTestCase
class IntegrationTestCase(TransactionTestCase):
def tearDown(self):
# Manual cleanup required
UserQueryHistory.objects.all().delete()
QueryAnalysis.objects.all().delete()
Query.objects.all().delete()
User.objects.all().delete()Always fetch objects by explicit ID to avoid cache interference:
# ❌ BAD - may return cached objects
simple_query_obj = Query.objects.first()
complex_query_obj = Query.objects.last()
# ✅ GOOD - fetch by ID from response
analysis_id = int(response.url.split('/')[-2])
analysis = QueryAnalysis.objects.get(id=analysis_id)
query_obj = analysis.queryRequired @override_settings for integration tests:
@override_settings(
RATELIMIT_ENABLE=False, # Disable rate limiting
CACHES={
'default': {'BACKEND': 'django.core.cache.backends.dummy.DummyCache'},
'query_analysis_cache': {'BACKEND': 'django.core.cache.backends.dummy.DummyCache'},
'process_cache': {'BACKEND': 'django.core.cache.backends.dummy.DummyCache'},
'template_cache': {'BACKEND': 'django.core.cache.backends.dummy.DummyCache'}
}
)- Query analysis results cached by query hash (2 hours)
- ML predictions cached with feature fingerprints
- Template fragments cached (30 minutes)
- Redis used for all cache backends with compression
- Heavy operations (log processing, model training) via Celery
- Task queues:
heavy_processing,light_processing,maintenance - Worker memory limits: 200MB per child, 1000 tasks per child
- Connection pooling enabled (CONN_MAX_AGE=600)
- Indexed fields: query_hash, user+timestamp, grades, scores
- Database query optimization via select_related() and prefetch_related()
- SQLite WAL mode for development
- Content Security Policy (CSP) configured
- Script/style sources restricted to self + CDN
- Django template auto-escaping enabled
- Custom security middleware for enhanced protection
- CSRF tokens required for all POST requests
- Cookie-based CSRF with 1-hour expiration
- Custom CSRF failure handling
- User authentication required for query submission
- JWT tokens for API access
- Session cookies with HTTPOnly, Secure, SameSite flags
- Password validation: 12+ chars, not common, not similar to username
- IP-based rate limiting: 5 requests/minute for unauthenticated
- User-based rate limiting: 10 requests/minute for authenticated
- API throttling: 100/hour anonymous, 1000/hour authenticated
- Collect feedback: Users submit queries and provide ratings
- Process feedback:
python manage.py process_ml_feedback - Train model:
python manage.py train_ml_model --algorithm xgboost - Monitor performance: Visit
/ml/dashboard/or runml_analyticscommand - Model auto-deployed if accuracy > threshold
- Set
DEBUG=Truein settings.py - Check logs in
logs/security.log - Use Django shell:
python manage.py shellfrom analyzer.query_analyzer import QueryGrader grader = QueryGrader() query, analysis = grader.analyze_query("SELECT * FROM users") print(analysis.grade, analysis.score, analysis.issues_found)
For debugging test failures, see:
The analyzer uses a modular architecture with specialized analyzer classes. To add new rules:
Option 1: Add to existing analyzer
- Identify relevant analyzer in
analyzer/analyzers/(e.g.,where_analyzer.py,join_analyzer.py) - Add detection logic to the analyzer's
analyze()method - Append issues/recommendations to
context.issuesorcontext.recommendations - Update scoring in
analyzer/analyzers/base.py_calculate_score()if needed - Add test case to
analyzer/test_query_grader.py - Run tests:
python manage.py test analyzer.test_query_grader
Option 2: Create new analyzer
- Create new file in
analyzer/analyzers/(e.g.,security_analyzer.py) - Inherit from
BaseAnalyzerand implementanalyze()andnameproperty - Register in
analyzer/analyzers/base.py_initialize_analyzers() - Add comprehensive test cases
- Run full test suite:
python manage.py test analyzer
Example new analyzer:
from .base import BaseAnalyzer, AnalysisContext
class SecurityAnalyzer(BaseAnalyzer):
@property
def name(self) -> str:
return "SecurityAnalyzer"
def analyze(self, context: AnalysisContext) -> None:
sql_upper = context.sql_text.upper()
# Detect SQL injection patterns
if re.search(r";\s*DROP\s+TABLE", sql_upper):
context.issues.append({
'type': 'sql_injection',
'severity': 'critical',
'message': 'Potential SQL injection detected'
})