Skip to content

SeedheCode-AI/Cyclone_Processing_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cyclone Data Processing Pipeline for LLM Training

πŸ“‹ Project Overview

This pipeline transforms raw IMD (Indian Meteorological Department) Best Tracks cyclone data into a comprehensive dataset suitable for Large Language Model (LLM) training, specifically designed for cyclone intensity prediction and tracking using transformers.

Original Dataset: c08063_Best Tracks__Data (1982-2024).xls
Final Output: LLM-ready dataset with rich text context
Time Period: 1982-2024 (43 years)
Data Retention: 8,070 records (86.6% retention rate)


πŸ—‚οΈ Directory Structure

Cyclone_Processing_Pipeline/
β”‚
β”œβ”€β”€ data/
β”‚   └── c08063_Best Tracks__Data (1982-2024).xls  # Original IMD data
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ 00_Data_Cleaning.py                    # Initial data cleaning
β”‚   β”œβ”€β”€ final_cleanup.py                       # Final cleanup and validation
β”‚   β”œβ”€β”€ 01_Data_Preprocessing_and_EDA.py       # EDA and analysis
β”‚   β”œβ”€β”€ llm_data_cleaning.py                   # LLM-focused data extraction
β”‚   β”œβ”€β”€ analyze_text_patterns.py               # Text pattern analysis
β”‚   └── analyze_llm_dataset.py                 # LLM dataset analysis
β”‚
β”œβ”€β”€ output/
β”‚   └── [Generated output files will be here]
β”‚
β”œβ”€β”€ logs/
β”‚   └── pipeline_YYYYMMDD_HHMMSS.log          # Pipeline execution logs
β”‚
β”œβ”€β”€ README.md                                  # This file
└── LLM_Data_Cleaning_Methodology.md          # Detailed methodology

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • Required packages: pandas, numpy, matplotlib, seaborn, openpyxl
  • Virtual environment (recommended)

Installation

# Clone or navigate to the project directory
cd Cyclone_Processing_Pipeline

# Install required packages
pip install pandas numpy matplotlib seaborn openpyxl

Running the Pipeline

# Option 1: Run complete pipeline automatically
python run_pipeline.py
# All output will be saved to logs/pipeline_YYYYMMDD_HHMMSS.log

# Option 2: Run individual steps
cd scripts

# Step 1: Initial data cleaning
python 00_Data_Cleaning.py

# Step 2: Final cleanup and validation
python final_cleanup.py

# Step 3: EDA and analysis
python 01_Data_Preprocessing_and_EDA.py

# Step 4: LLM data extraction
python llm_data_cleaning.py

# Step 5: Text pattern analysis
python analyze_text_patterns.py

# Step 6: LLM dataset analysis
python analyze_llm_dataset.py

πŸ“Š Processing Pipeline Overview

Phase 1: Initial Data Cleaning (00_Data_Cleaning.py)

Objective: Clean the raw Excel data while preserving maximum information

Key Features:

  • Multi-sheet Excel processing (43 sheets, one per year)
  • Header standardization across different years
  • Data type validation and cleaning
  • Time format standardization (HHMM)
  • Geographic coordinate validation
  • Meteorological parameter validation

Output: cleaned_cyclone_data.csv

Phase 2: Final Cleanup (final_cleanup.py)

Objective: Remove unwanted columns and finalize dataset structure

Key Features:

  • Remove duplicate and corrupted columns
  • Reorder columns in logical sequence
  • Final data quality validation
  • Duplicate row removal
  • Dataset summary generation

Output: cleaned_cyclone_data_final.csv

Phase 3: EDA and Analysis (01_Data_Preprocessing_and_EDA.py)

Objective: Comprehensive exploratory data analysis

Key Features:

  • Temporal analysis (yearly, monthly patterns)
  • Geographic analysis (basin distribution)
  • Intensity analysis (grade distribution)
  • Correlation analysis
  • Visualization generation
  • Data quality assessment

Output: Multiple analysis plots and statistics

Phase 4: LLM Data Extraction (llm_data_cleaning.py)

Objective: Create LLM-ready dataset with rich text context

Key Features:

  • Text annotation extraction and classification
  • Merged content preservation
  • Context text generation
  • Rich feature engineering
  • Multi-format data preservation

Output: llm_cyclone_dataset.csv (8,070 records, 23 columns)

Phase 5: Text Pattern Analysis (analyze_text_patterns.py)

Objective: Analyze text patterns in specific sheets (2009, 2024)

Key Features:

  • Text pattern identification
  • Merged content analysis
  • Annotation type classification
  • Pattern statistics generation

Output: text_patterns_analysis.csv, merged_content_analysis.csv

Phase 6: LLM Dataset Analysis (analyze_llm_dataset.py)

Objective: Comprehensive analysis of the final LLM dataset

Key Features:

  • Rich text content analysis
  • Context length statistics
  • Annotation type distribution
  • Visualization generation
  • Training example extraction

Output: Multiple analysis files and visualizations


πŸ“ˆ Data Quality Metrics

Metric Value Description
Data Retention Rate 86.6% 8,070 out of 8,928 original rows
Text Annotation Coverage 8.1% 651 records with text annotations
Merged Content Coverage 3.2% 258 records with merged content
Average Context Length 155.3 chars Rich context for LLM training
Annotation Types 5 CROSSING_EVENT, WEAKENING_EVENT, etc.
Time Period 43 years 1982-2024 complete coverage
Unique Cyclones 115 Diverse cyclone patterns

🎯 LLM Training Features

Structured Features

  1. Basic Cyclone Data: Year, Name, Basin, Date, Time
  2. Geographic Data: Latitude, Longitude
  3. Meteorological Data: Grade, Wind_Speed, Central_Pressure, CI_Number
  4. Derived Features: Pressure_Drop, Outermost_Isobar, Size

Text Features

  1. Text_Annotations: Structured text annotations with type classification
  2. Annotation_Count: Number of annotations per record
  3. Merged_Content: Multi-column text content
  4. Has_Merged_Content: Boolean flag for merged content
  5. Context_Text: Rich natural language context
  6. Context_Length: Character count for context richness

Annotation Types Discovered

  • CROSSING_EVENT: 235 instances (landfall descriptions)
  • WEAKENING_EVENT: 295 instances (intensity changes)
  • COORDINATE_REFERENCE: 78 instances (geographic details)
  • TIME_REFERENCE: 41 instances (temporal information)
  • LOCATION_REFERENCE: 8 instances (place names)

πŸš€ Applications for Cyclone Prediction

1. Intensity Prediction

  • Input: Historical cyclone data + text descriptions
  • Output: Predicted intensity changes (wind speed, pressure)
  • Model: Transformer with sequence modeling

2. Path Tracking

  • Input: Current cyclone state + historical path
  • Output: Predicted path coordinates and landfall location
  • Model: Transformer with geographic attention

3. Natural Language Generation

  • Input: Structured cyclone data
  • Output: Natural language cyclone reports
  • Model: Transformer with text generation capabilities

4. Knowledge Extraction

  • Input: Historical cyclone descriptions
  • Output: Structured meteorological insights
  • Model: Transformer with information extraction

πŸ“ Output Files

Main Datasets (3 CSV Files)

  1. output/cleaned_cyclone_data.csv (564KB, 7,715 records)

    • Purpose: Initial cleaned dataset after basic data cleaning
    • Significance: First step in data quality improvement, contains standardized headers and cleaned data types
    • Use Case: Baseline for comparison with final dataset
  2. output/cleaned_cyclone_data_final.csv (491KB, 7,715 records)

    • Purpose: Final cleaned dataset with optimized structure
    • Significance: Removes duplicate/corrupted columns, reorders for logical sequence
    • Use Case: Primary dataset for traditional cyclone analysis and modeling
  3. output/llm_cyclone_dataset.csv (2.1MB, 8,070 records)

    • Purpose: LLM-ready dataset with rich text context
    • Significance: Core training dataset for transformer models with structured + text features
    • Use Case: Primary dataset for cyclone intensity prediction and tracking with transformers

Analysis Files (6 CSV Files)

  1. output/rich_text_records.csv (351KB, 668 records)

    • Purpose: Subset of records containing rich text annotations
    • Significance: High-value training examples with natural language context
    • Use Case: Fine-tuning models on text-rich cyclone events
  2. output/training_examples.csv (26KB, 102 records)

    • Purpose: Curated sample of diverse training examples
    • Significance: Representative examples showing different annotation types and patterns
    • Use Case: Model validation, testing, and demonstration
  3. output/dataset_summary.csv (247B, 3 records)

    • Purpose: Statistical summary of the complete dataset
    • Significance: Quick overview of data quality, coverage, and distribution
    • Use Case: Data quality assessment and reporting
  4. output/text_patterns_analysis.csv (17KB, 688 records)

    • Purpose: Detailed analysis of text patterns in specific sheets (2009, 2024)
    • Significance: Understanding text annotation patterns and classification
    • Use Case: Improving text extraction algorithms and pattern recognition
  5. output/merged_content_analysis.csv (9.6KB, 64 records)

    • Purpose: Analysis of merged cell content across sheets
    • Significance: Preserves complex multi-column text information
    • Use Case: Enhanced context generation for LLM training
  6. output/processed_cyclone_data.csv (786KB, 7,715 records)

    • Purpose: Intermediate processed dataset with enhanced features
    • Significance: Contains derived features and enhanced data structure
    • Use Case: Feature engineering and advanced analysis

Log Files

  • logs/pipeline_YYYYMMDD_HHMMSS.log: Complete pipeline execution logs with timestamps

Visualizations

  • cyclone_eda_plots.png: EDA visualizations
  • correlation_heatmap.png: Correlation analysis
  • time_series.png: Temporal analysis
  • geographic_distribution.png: Geographic analysis
  • llm_dataset_analysis_fixed.png: LLM dataset analysis

πŸ”§ Technical Details

Key Libraries Used

  • pandas: Data manipulation and cleaning
  • numpy: Numerical operations
  • matplotlib/seaborn: Data visualization
  • openpyxl: Excel file processing
  • re: Regular expressions for text processing

Processing Pipeline

  1. Sheet-by-Sheet Processing: Handle each year's data individually
  2. Header Detection: Automatically identify and standardize headers
  3. Text Extraction: Preserve all descriptive text content
  4. Data Standardization: Consistent column names and data types
  5. Context Generation: Create rich context for LLM training
  6. Quality Validation: Ensure data integrity and completeness

Performance Considerations

  • Memory Efficient: Process sheets individually to manage memory
  • Scalable: Methodology applicable to larger datasets
  • Reproducible: Consistent results across different runs
  • Logging: Complete execution logs saved for debugging and reproducibility

🎯 Success Metrics

Data Quality Achievements

βœ… 86.6% Data Retention: Preserved most original data
βœ… 100% Text Content Preserved: All annotations and merged content extracted
βœ… 5 Annotation Types: Comprehensive text classification system
βœ… Rich Context: Average 155.3 characters of context per record
βœ… 43-Year Coverage: Complete temporal coverage (1982-2024)
βœ… 115 Unique Cyclones: Comprehensive cyclone diversity

LLM Training Readiness

βœ… Structured + Text Data: Perfect for transformer model training
βœ… Rich Context: Natural language descriptions for each record
βœ… Diverse Patterns: Multiple annotation types and content styles
βœ… Temporal Patterns: Long-term cyclone evolution data
βœ… Geographic Coverage: Multiple basins and regions


πŸš€ Future Enhancements

Potential Improvements

  1. Advanced Text Classification: Machine learning-based annotation classification
  2. Semantic Analysis: Extract deeper meaning from text descriptions
  3. Multi-Modal Features: Incorporate satellite imagery data
  4. Real-Time Processing: Stream processing for live cyclone data
  5. Cross-Dataset Integration: Combine with other meteorological datasets

LLM Training Applications

  1. Fine-tune Large Language Models: Specialized cyclone prediction models
  2. Develop Chatbots: Interactive cyclone information systems
  3. Automated Report Generation: Natural language cyclone reports
  4. Knowledge Discovery: Extract patterns from historical data
  5. Educational Tools: AI-powered cyclone learning systems

πŸ“ Usage Examples

For Intensity Prediction

import pandas as pd

# Load the LLM dataset
df = pd.read_csv("output/llm_cyclone_dataset.csv")

# Filter for intensity-related annotations
intensity_data = df[df['Text_Annotations'].str.contains('WEAKENING_EVENT', na=False)]

# Use for transformer training
# Input: Context_Text + structured features
# Output: Next intensity prediction

For Path Tracking

# Filter for crossing events
tracking_data = df[df['Text_Annotations'].str.contains('CROSSING_EVENT', na=False)]

# Use for path prediction
# Input: Historical coordinates + text context
# Output: Next position prediction

For Model Training

# Load the complete dataset
df = pd.read_csv("output/llm_cyclone_dataset.csv")

# Prepare training data
training_data = df[df['Context_Length'] > 50].copy()

# Use for transformer fine-tuning
# This dataset is ready for cyclone prediction model training

🀝 Contributing

This pipeline is designed to be:

  • Reproducible: All steps are documented and automated
  • Extensible: Easy to add new processing steps
  • Maintainable: Clear code structure and documentation
  • Scalable: Applicable to larger datasets

For questions or improvements, please refer to the detailed methodology in LLM_Data_Cleaning_Methodology.md.

πŸ‘¨β€πŸ’» Development

This cyclone data processing pipeline was developed to address the challenges of transforming raw meteorological data into a format suitable for modern transformer-based models. The pipeline incorporates advanced text extraction techniques, data cleaning methodologies, and context generation to create a comprehensive dataset for cyclone intensity prediction and tracking.


πŸ“„ License

This project is developed for cyclone prediction research using transformers. The methodology and scripts are provided for research and educational purposes.


πŸ“ž Contact

For questions about the cyclone data processing pipeline or LLM training methodology, please refer to the comprehensive documentation in LLM_Data_Cleaning_Methodology.md.

🎯 Project Achievements

Total Processing Time: ~15 minutes
Data Quality Score: 86.6% retention with 100% text preservation
LLM Training Readiness: Excellent (8,070 rich context records)

This pipeline successfully demonstrates the potential of combining structured meteorological data with natural language processing techniques for advanced cyclone prediction systems.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages