A tool that uses LLM (Databricks Model Serving, OpenAI, Azure OpenAI, Anthropic) to analyze Databricks SQL Profiler JSON logs, identify bottlenecks, and provide optimization recommendations.
- Profiler JSON Analysis: Parse SQL Profiler output including
graphsand metrics - Metrics Extraction: Execution time, data volume, cache efficiency, node details
- Bottleneck Detection: Skew, spill, shuffle, I/O hotspots, Photon efficiency
- Prioritized Recommendations: HIGH/MEDIUM/LOW optimization suggestions
- Iterative Optimization: Up to 3 optimization attempts with progressive improvement
- EXPLAIN/EXPLAIN COST Analysis: Execution plan-based optimization verification
- Multi-language Output: Japanese/English report generation
- Multi-LLM Provider: Databricks, OpenAI, Azure OpenAI, Anthropic
- Python 3.9+
- Databricks Runtime (for notebook execution)
requestslibrary
# Clone repository
git clone https://github.com/MitsuhiroItagaki/query_profiler_analysis.git
cd query_profiler_analysis
# Development setup
pip install -e ".[dev]"query_profiler_analysis/
├── notebooks/
│ └── main_full.py # Databricks notebook (recommended)
├── src/ # Modular source code
│ ├── config.py # Configuration management
│ ├── models.py # Data models
│ ├── llm/ # LLM clients
│ ├── profiler/ # Profiler analysis
│ ├── optimization/ # Query optimization
│ ├── report/ # Report generation
│ └── utils/ # Utilities
├── tests/ # Test code
├── query_profiler_analysis.py # Full analysis logic
└── pyproject.toml
-
Clone repository to Databricks Repos
- Repos → Add Repo →
https://github.com/MitsuhiroItagaki/query_profiler_analysis.git
- Repos → Add Repo →
-
Open
notebooks/main_full.py -
Edit the configuration cell:
# Required settings
JSON_FILE_PATH = '/Volumes/your_catalog/your_schema/your_volume/query-profile.json'
OUTPUT_FILE_DIR = './output'
OUTPUT_LANGUAGE = 'en' # 'ja' or 'en'
EXPLAIN_ENABLED = 'Y' # 'Y' = execute EXPLAIN, 'N' = skip
CATALOG = 'your_catalog'
DATABASE = 'your_database'
DEBUG_ENABLED = 'N' # 'Y' = keep intermediate files, 'N' = final files only
# LLM configuration
LLM_CONFIG = {
"provider": "databricks",
"databricks": {
"endpoint_name": "databricks-claude-opus-4-5",
"max_tokens": 32000,
"temperature": 0.0,
},
}- Run All to execute all cells
| Setting | Description | Default |
|---|---|---|
JSON_FILE_PATH |
SQL Profiler JSON file path | Required |
OUTPUT_FILE_DIR |
Output directory | ./output |
OUTPUT_LANGUAGE |
Output language (ja/en) |
en |
EXPLAIN_ENABLED |
Execute EXPLAIN (Y/N) |
Y |
CATALOG |
Catalog to use | Required (for EXPLAIN) |
DATABASE |
Database to use | Required (for EXPLAIN) |
DEBUG_ENABLED |
Debug mode (Y/N) |
N |
MAX_OPTIMIZATION_ATTEMPTS |
Number of optimization attempts | 3 |
LLM_CONFIG = {
"provider": "databricks",
"databricks": {
"endpoint_name": "databricks-claude-opus-4-5",
"max_tokens": 32000,
"temperature": 0.0,
"thinking_enabled": False,
},
}LLM_CONFIG = {
"provider": "openai",
"openai": {
"api_key": "", # or use OPENAI_API_KEY environment variable
"model": "gpt-4o",
"max_tokens": 16000,
"temperature": 0.0,
},
}LLM_CONFIG = {
"provider": "anthropic",
"anthropic": {
"api_key": "", # or use ANTHROPIC_API_KEY environment variable
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 16000,
"temperature": 0.0,
},
}| File | Description |
|---|---|
output_original_query_*.sql |
Original query |
output_optimized_query_*.sql |
Optimized query |
output_optimization_report_*.md |
Optimization report |
output_final_report_*.md |
LLM-refined final report |
output_explain_*.txt- EXPLAIN resultsoutput_explain_cost_*.txt- EXPLAIN COST resultsoutput_performance_judgment_log_*.txt- Performance judgment logs- Other intermediate files
- Verify
CATALOGandDATABASEare correctly configured - Check access permissions for tables used in the query
- Ensure table names use full path (
catalog.schema.table)
# Test EXPLAIN manually
spark.sql("USE CATALOG your_catalog")
spark.sql("USE DATABASE your_database")
spark.sql("EXPLAIN SELECT * FROM your_table LIMIT 1").show(truncate=False)When LLM generates incomplete SQL, it automatically falls back to the original query. The log will show: "Warning: Generated SQL is missing SELECT clause"
Ensure DEBUG_ENABLED = 'N' is set.
The cleanup cell at the end of the notebook will automatically delete intermediate files.
# Run tests
pytest
# Single test
pytest tests/test_profiler.py -v
# Lint
ruff check src/ tests/
# Type check
mypy src/- Added
notebooks/main_full.pyas recommended entry point - Load full analysis logic via
%run SKIP_AUTO_CLEANUPflag for proper cleanup timing control- Fallback feature for incomplete SQL generation
- Automatic intermediate file deletion when
DEBUG_ENABLED='N'
- Split 20,000-line single file into 20+ modules
- Type-safe configuration with dataclasses
- Strategy Pattern for LLM client abstraction
- Added pytest unit tests
- Single file implementation
- Basic bottleneck analysis and optimization
MIT