HuggingFace Data Processing Pipelines (Modular)

This repository contains modular data processing pipelines that fetch, process, and analyze data from HuggingFace Hub.

📁 Project Structure

├── config.py              # Configuration for models pipeline
├── config_datasets.py     # Configuration for datasets pipeline
├── config_papers.py       # Configuration for papers pipeline
├── utils.py               # Shared utility functions and logging
├── data_fetcher.py        # Data fetching for models
├── data_fetcher_datasets.py # Data fetching for datasets
├── data_fetcher_papers.py # Data fetching for papers
├── tag_processor.py       # Tag processing for models
├── tag_processor_datasets.py # Tag processing for datasets
├── data_processor.py      # Main processing logic for models
├── data_processor_datasets.py # Main processing logic for datasets
├── data_processor_papers.py # Semantic taxonomy mapping for papers
├── main.py               # Models pipeline orchestrator
├── main_datasets.py      # Datasets pipeline orchestrator
├── main_papers.py        # Papers pipeline orchestrator
├── test_pipeline.py      # Integration test for models
├── test_pipeline_datasets.py # Integration test for datasets
├── test_pipeline_papers.py # Integration test for papers
├── hub_download.py       # Weekly snapshot downloader
├── integrated_ml_taxonomy.json # ML taxonomy for papers
├── requirements.txt      # Python dependencies
└── README.md            # This documentation

🚀 Available Pipelines

1. Models Pipeline (`main.py`)

Processes HuggingFace model data with feature extraction and categorization.

2. Datasets Pipeline (`main_datasets.py`)

Processes HuggingFace datasets data.

3. Papers Pipeline (`main_papers.py`) ⭐ NEW

Processes academic papers with semantic taxonomy mapping using spaCy NLP.

📄 Papers Pipeline Details

The papers pipeline includes advanced semantic analysis and citation tracking:

Data Source: Loads papers from cfahlgren1/hub-stats dataset (daily_papers.parquet)
Semantic Taxonomy: Uses spaCy's en_core_web_lg model for semantic similarity
Hierarchical Classification: Maps paper keywords to ML taxonomy:
- Categories (e.g., Computer Vision, NLP, Deep Learning)
- Subcategories (e.g., Object Detection, Text Classification, GANs)
- Topics (e.g., YOLO, BERT, Transformers)
Multi-Label Classification: Papers can have multiple categories if they have close similarity scores (within 90% of top score)
Citation Tracking: Fetches citation counts using paperscraper (via DOI and title)
Rich Metadata: Preserves all 33+ original columns (authors, GitHub repos, upvotes, etc.)
Reports & Analytics: Generates detailed matching reports and statistics
Auto-Upload: Uploads to HuggingFace evijit/paperverse_daily_data

Papers Pipeline Output

The pipeline generates:

papers_with_semantic_taxonomy.parquet - Full dataset with taxonomy
papers_with_semantic_taxonomy.csv - CSV version
taxonomy_report.txt - Detailed text report
taxonomy_distribution.json - Statistics in JSON format

🔧 Configuration

Key settings in respective config files:

Models (config.py):

MODEL_ID_TO_DEBUG: Specific model ID for debugging
TAG_MAP: Feature flags and keywords
MODEL_SIZE_RANGES: Size categorization thresholds

Papers (config_papers.py):

TAXONOMY_FILE_PATH: Path to ML taxonomy JSON
SIMILARITY_THRESHOLD: Minimum cosine similarity (default: 0.55)
SPACY_MODEL: NLP model to use (default: en_core_web_lg)
HF_REPO_ID: Target HuggingFace repository
ENABLE_CITATION_FETCHING: Enable/disable citation fetching (default: True)
CITATION_BATCH_SIZE: Batch size for progress updates (default: 100)
MULTI_CLASS_ENABLED: Allow multiple classifications per paper (default: True)
MULTI_CLASS_SCORE_THRESHOLD: Include classes within 90% of top score (default: 0.90)
MAX_CLASSIFICATIONS: Maximum classifications per level (default: 5)

🧪 Testing Individual Modules

You can test each pipeline independently:

# Test models pipeline (small subset)
export TEST_DATA_LIMIT=100
python test_pipeline.py

# Test datasets pipeline (small subset)
export TEST_DATA_LIMIT=100
python test_pipeline_datasets.py

# Test papers pipeline (small subset)
export TEST_DATA_LIMIT=50
python test_pipeline_papers.py

# Run full pipelines
python main.py           # Models
python main_datasets.py  # Datasets
python main_papers.py    # Papers

📦 Installation

Basic Installation

pip install -r requirements.txt

Papers Pipeline - Additional Setup

The papers pipeline requires the spaCy language model and citation scraper:

# Download the spaCy model (will auto-download if missing)
python -m spacy download en_core_web_lg

# Install paperscraper for citation tracking
pip install paperscraper

Notes:

The en_core_web_lg model is ~500MB and will auto-download if not found
paperscraper fetches citation counts from Semantic Scholar and Google Scholar
Citation fetching can be disabled by setting ENABLE_CITATION_FETCHING = False in config

☁️ HuggingFace Upload

To enable automatic upload to HuggingFace:

# Set your HuggingFace token
export HF_TOKEN="your_huggingface_token_here"

# Run the papers pipeline
python main_papers.py

The papers pipeline will upload results to: evijit/paperverse_daily_data

Getting a HuggingFace Token

Go to https://huggingface.co/settings/tokens
Create a new token with write permissions
Copy the token and set it as an environment variable

🔄 GitHub Actions / CI/CD

For automated runs, add HF_TOKEN to your repository secrets:

Go to repository Settings → Secrets and variables → Actions
Add new secret: HF_TOKEN with your token value
The workflow will automatically upload results

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
.github		.github
__pycache__		__pycache__
.DS_Store		.DS_Store
.gitattributes		.gitattributes
PAPERS_WAVE_PROCESSING.md		PAPERS_WAVE_PROCESSING.md
RATE_LIMITING_INFO.md		RATE_LIMITING_INFO.md
README.md		README.md
calculate_waves.py		calculate_waves.py
config.py		config.py
config_datasets.py		config_datasets.py
config_papers.py		config_papers.py
data_fetcher.py		data_fetcher.py
data_fetcher_datasets.py		data_fetcher_datasets.py
data_fetcher_papers.py		data_fetcher_papers.py
data_processor.py		data_processor.py
data_processor_datasets.py		data_processor_datasets.py
data_processor_papers.py		data_processor_papers.py
fetch_citations_batch.py		fetch_citations_batch.py
generate_all_waves.py		generate_all_waves.py
generate_matrix.py		generate_matrix.py
generate_matrix_wave.py		generate_matrix_wave.py
hub_download.py		hub_download.py
integrated_ml_taxonomy.json		integrated_ml_taxonomy.json
main.py		main.py
main_datasets.py		main_datasets.py
main_papers.py		main_papers.py
merge_citation_batches.py		merge_citation_batches.py
merge_citations_into_papers.py		merge_citations_into_papers.py
requirements.txt		requirements.txt
split_citation_jobs.py		split_citation_jobs.py
tag_processor.py		tag_processor.py
tag_processor_datasets.py		tag_processor_datasets.py
test_citation_skip.py		test_citation_skip.py
test_citations.py		test_citations.py
test_citations_simple.py		test_citations_simple.py
test_citations_with_authors.py		test_citations_with_authors.py
test_pipeline.py		test_pipeline.py
test_pipeline_datasets.py		test_pipeline_datasets.py
test_pipeline_papers.py		test_pipeline_papers.py
test_rate_limit_fresh.py		test_rate_limit_fresh.py
test_semantic_scholar.py		test_semantic_scholar.py
test_ss_formats.py		test_ss_formats.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HuggingFace Data Processing Pipelines (Modular)

📁 Project Structure

🚀 Available Pipelines

1. Models Pipeline (`main.py`)

2. Datasets Pipeline (`main_datasets.py`)

3. Papers Pipeline (`main_papers.py`) ⭐ NEW

📄 Papers Pipeline Details

Papers Pipeline Output

🔧 Configuration

🧪 Testing Individual Modules

📦 Installation

Basic Installation

Papers Pipeline - Additional Setup

☁️ HuggingFace Upload

Getting a HuggingFace Token

🔄 GitHub Actions / CI/CD

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

evijit/orgstats_cron

Folders and files

Latest commit

History

Repository files navigation

HuggingFace Data Processing Pipelines (Modular)

📁 Project Structure

🚀 Available Pipelines

1. Models Pipeline (main.py)

2. Datasets Pipeline (main_datasets.py)

3. Papers Pipeline (main_papers.py) ⭐ NEW

📄 Papers Pipeline Details

Papers Pipeline Output

🔧 Configuration

🧪 Testing Individual Modules

📦 Installation

Basic Installation

Papers Pipeline - Additional Setup

☁️ HuggingFace Upload

Getting a HuggingFace Token

🔄 GitHub Actions / CI/CD

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1. Models Pipeline (`main.py`)

2. Datasets Pipeline (`main_datasets.py`)

3. Papers Pipeline (`main_papers.py`) ⭐ NEW

Packages