This pipeline transforms raw IMD (Indian Meteorological Department) Best Tracks cyclone data into a comprehensive dataset suitable for Large Language Model (LLM) training, specifically designed for cyclone intensity prediction and tracking using transformers.
Original Dataset: c08063_Best Tracks__Data (1982-2024).xls
Final Output: LLM-ready dataset with rich text context
Time Period: 1982-2024 (43 years)
Data Retention: 8,070 records (86.6% retention rate)
Cyclone_Processing_Pipeline/
β
βββ data/
β βββ c08063_Best Tracks__Data (1982-2024).xls # Original IMD data
β
βββ scripts/
β βββ 00_Data_Cleaning.py # Initial data cleaning
β βββ final_cleanup.py # Final cleanup and validation
β βββ 01_Data_Preprocessing_and_EDA.py # EDA and analysis
β βββ llm_data_cleaning.py # LLM-focused data extraction
β βββ analyze_text_patterns.py # Text pattern analysis
β βββ analyze_llm_dataset.py # LLM dataset analysis
β
βββ output/
β βββ [Generated output files will be here]
β
βββ logs/
β βββ pipeline_YYYYMMDD_HHMMSS.log # Pipeline execution logs
β
βββ README.md # This file
βββ LLM_Data_Cleaning_Methodology.md # Detailed methodology
- Python 3.8+
- Required packages:
pandas,numpy,matplotlib,seaborn,openpyxl - Virtual environment (recommended)
# Clone or navigate to the project directory
cd Cyclone_Processing_Pipeline
# Install required packages
pip install pandas numpy matplotlib seaborn openpyxl# Option 1: Run complete pipeline automatically
python run_pipeline.py
# All output will be saved to logs/pipeline_YYYYMMDD_HHMMSS.log
# Option 2: Run individual steps
cd scripts
# Step 1: Initial data cleaning
python 00_Data_Cleaning.py
# Step 2: Final cleanup and validation
python final_cleanup.py
# Step 3: EDA and analysis
python 01_Data_Preprocessing_and_EDA.py
# Step 4: LLM data extraction
python llm_data_cleaning.py
# Step 5: Text pattern analysis
python analyze_text_patterns.py
# Step 6: LLM dataset analysis
python analyze_llm_dataset.pyObjective: Clean the raw Excel data while preserving maximum information
Key Features:
- Multi-sheet Excel processing (43 sheets, one per year)
- Header standardization across different years
- Data type validation and cleaning
- Time format standardization (HHMM)
- Geographic coordinate validation
- Meteorological parameter validation
Output: cleaned_cyclone_data.csv
Objective: Remove unwanted columns and finalize dataset structure
Key Features:
- Remove duplicate and corrupted columns
- Reorder columns in logical sequence
- Final data quality validation
- Duplicate row removal
- Dataset summary generation
Output: cleaned_cyclone_data_final.csv
Objective: Comprehensive exploratory data analysis
Key Features:
- Temporal analysis (yearly, monthly patterns)
- Geographic analysis (basin distribution)
- Intensity analysis (grade distribution)
- Correlation analysis
- Visualization generation
- Data quality assessment
Output: Multiple analysis plots and statistics
Objective: Create LLM-ready dataset with rich text context
Key Features:
- Text annotation extraction and classification
- Merged content preservation
- Context text generation
- Rich feature engineering
- Multi-format data preservation
Output: llm_cyclone_dataset.csv (8,070 records, 23 columns)
Objective: Analyze text patterns in specific sheets (2009, 2024)
Key Features:
- Text pattern identification
- Merged content analysis
- Annotation type classification
- Pattern statistics generation
Output: text_patterns_analysis.csv, merged_content_analysis.csv
Objective: Comprehensive analysis of the final LLM dataset
Key Features:
- Rich text content analysis
- Context length statistics
- Annotation type distribution
- Visualization generation
- Training example extraction
Output: Multiple analysis files and visualizations
| Metric | Value | Description |
|---|---|---|
| Data Retention Rate | 86.6% | 8,070 out of 8,928 original rows |
| Text Annotation Coverage | 8.1% | 651 records with text annotations |
| Merged Content Coverage | 3.2% | 258 records with merged content |
| Average Context Length | 155.3 chars | Rich context for LLM training |
| Annotation Types | 5 | CROSSING_EVENT, WEAKENING_EVENT, etc. |
| Time Period | 43 years | 1982-2024 complete coverage |
| Unique Cyclones | 115 | Diverse cyclone patterns |
- Basic Cyclone Data: Year, Name, Basin, Date, Time
- Geographic Data: Latitude, Longitude
- Meteorological Data: Grade, Wind_Speed, Central_Pressure, CI_Number
- Derived Features: Pressure_Drop, Outermost_Isobar, Size
- Text_Annotations: Structured text annotations with type classification
- Annotation_Count: Number of annotations per record
- Merged_Content: Multi-column text content
- Has_Merged_Content: Boolean flag for merged content
- Context_Text: Rich natural language context
- Context_Length: Character count for context richness
- CROSSING_EVENT: 235 instances (landfall descriptions)
- WEAKENING_EVENT: 295 instances (intensity changes)
- COORDINATE_REFERENCE: 78 instances (geographic details)
- TIME_REFERENCE: 41 instances (temporal information)
- LOCATION_REFERENCE: 8 instances (place names)
- Input: Historical cyclone data + text descriptions
- Output: Predicted intensity changes (wind speed, pressure)
- Model: Transformer with sequence modeling
- Input: Current cyclone state + historical path
- Output: Predicted path coordinates and landfall location
- Model: Transformer with geographic attention
- Input: Structured cyclone data
- Output: Natural language cyclone reports
- Model: Transformer with text generation capabilities
- Input: Historical cyclone descriptions
- Output: Structured meteorological insights
- Model: Transformer with information extraction
-
output/cleaned_cyclone_data.csv(564KB, 7,715 records)- Purpose: Initial cleaned dataset after basic data cleaning
- Significance: First step in data quality improvement, contains standardized headers and cleaned data types
- Use Case: Baseline for comparison with final dataset
-
output/cleaned_cyclone_data_final.csv(491KB, 7,715 records)- Purpose: Final cleaned dataset with optimized structure
- Significance: Removes duplicate/corrupted columns, reorders for logical sequence
- Use Case: Primary dataset for traditional cyclone analysis and modeling
-
output/llm_cyclone_dataset.csv(2.1MB, 8,070 records)- Purpose: LLM-ready dataset with rich text context
- Significance: Core training dataset for transformer models with structured + text features
- Use Case: Primary dataset for cyclone intensity prediction and tracking with transformers
-
output/rich_text_records.csv(351KB, 668 records)- Purpose: Subset of records containing rich text annotations
- Significance: High-value training examples with natural language context
- Use Case: Fine-tuning models on text-rich cyclone events
-
output/training_examples.csv(26KB, 102 records)- Purpose: Curated sample of diverse training examples
- Significance: Representative examples showing different annotation types and patterns
- Use Case: Model validation, testing, and demonstration
-
output/dataset_summary.csv(247B, 3 records)- Purpose: Statistical summary of the complete dataset
- Significance: Quick overview of data quality, coverage, and distribution
- Use Case: Data quality assessment and reporting
-
output/text_patterns_analysis.csv(17KB, 688 records)- Purpose: Detailed analysis of text patterns in specific sheets (2009, 2024)
- Significance: Understanding text annotation patterns and classification
- Use Case: Improving text extraction algorithms and pattern recognition
-
output/merged_content_analysis.csv(9.6KB, 64 records)- Purpose: Analysis of merged cell content across sheets
- Significance: Preserves complex multi-column text information
- Use Case: Enhanced context generation for LLM training
-
output/processed_cyclone_data.csv(786KB, 7,715 records)- Purpose: Intermediate processed dataset with enhanced features
- Significance: Contains derived features and enhanced data structure
- Use Case: Feature engineering and advanced analysis
logs/pipeline_YYYYMMDD_HHMMSS.log: Complete pipeline execution logs with timestamps
cyclone_eda_plots.png: EDA visualizationscorrelation_heatmap.png: Correlation analysistime_series.png: Temporal analysisgeographic_distribution.png: Geographic analysisllm_dataset_analysis_fixed.png: LLM dataset analysis
- pandas: Data manipulation and cleaning
- numpy: Numerical operations
- matplotlib/seaborn: Data visualization
- openpyxl: Excel file processing
- re: Regular expressions for text processing
- Sheet-by-Sheet Processing: Handle each year's data individually
- Header Detection: Automatically identify and standardize headers
- Text Extraction: Preserve all descriptive text content
- Data Standardization: Consistent column names and data types
- Context Generation: Create rich context for LLM training
- Quality Validation: Ensure data integrity and completeness
- Memory Efficient: Process sheets individually to manage memory
- Scalable: Methodology applicable to larger datasets
- Reproducible: Consistent results across different runs
- Logging: Complete execution logs saved for debugging and reproducibility
β
86.6% Data Retention: Preserved most original data
β
100% Text Content Preserved: All annotations and merged content extracted
β
5 Annotation Types: Comprehensive text classification system
β
Rich Context: Average 155.3 characters of context per record
β
43-Year Coverage: Complete temporal coverage (1982-2024)
β
115 Unique Cyclones: Comprehensive cyclone diversity
β
Structured + Text Data: Perfect for transformer model training
β
Rich Context: Natural language descriptions for each record
β
Diverse Patterns: Multiple annotation types and content styles
β
Temporal Patterns: Long-term cyclone evolution data
β
Geographic Coverage: Multiple basins and regions
- Advanced Text Classification: Machine learning-based annotation classification
- Semantic Analysis: Extract deeper meaning from text descriptions
- Multi-Modal Features: Incorporate satellite imagery data
- Real-Time Processing: Stream processing for live cyclone data
- Cross-Dataset Integration: Combine with other meteorological datasets
- Fine-tune Large Language Models: Specialized cyclone prediction models
- Develop Chatbots: Interactive cyclone information systems
- Automated Report Generation: Natural language cyclone reports
- Knowledge Discovery: Extract patterns from historical data
- Educational Tools: AI-powered cyclone learning systems
import pandas as pd
# Load the LLM dataset
df = pd.read_csv("output/llm_cyclone_dataset.csv")
# Filter for intensity-related annotations
intensity_data = df[df['Text_Annotations'].str.contains('WEAKENING_EVENT', na=False)]
# Use for transformer training
# Input: Context_Text + structured features
# Output: Next intensity prediction# Filter for crossing events
tracking_data = df[df['Text_Annotations'].str.contains('CROSSING_EVENT', na=False)]
# Use for path prediction
# Input: Historical coordinates + text context
# Output: Next position prediction# Load the complete dataset
df = pd.read_csv("output/llm_cyclone_dataset.csv")
# Prepare training data
training_data = df[df['Context_Length'] > 50].copy()
# Use for transformer fine-tuning
# This dataset is ready for cyclone prediction model trainingThis pipeline is designed to be:
- Reproducible: All steps are documented and automated
- Extensible: Easy to add new processing steps
- Maintainable: Clear code structure and documentation
- Scalable: Applicable to larger datasets
For questions or improvements, please refer to the detailed methodology in LLM_Data_Cleaning_Methodology.md.
This cyclone data processing pipeline was developed to address the challenges of transforming raw meteorological data into a format suitable for modern transformer-based models. The pipeline incorporates advanced text extraction techniques, data cleaning methodologies, and context generation to create a comprehensive dataset for cyclone intensity prediction and tracking.
This project is developed for cyclone prediction research using transformers. The methodology and scripts are provided for research and educational purposes.
For questions about the cyclone data processing pipeline or LLM training methodology, please refer to the comprehensive documentation in LLM_Data_Cleaning_Methodology.md.
Total Processing Time: ~15 minutes
Data Quality Score: 86.6% retention with 100% text preservation
LLM Training Readiness: Excellent (8,070 rich context records)
This pipeline successfully demonstrates the potential of combining structured meteorological data with natural language processing techniques for advanced cyclone prediction systems.