DataExtractor Pro 🚀

DataExtractor Pro is a high-performance, developer-focused CLI tool designed to bridge the gap between messy, unformatted HuggingFace datasets and actionable training data for LLMs and Machine Learning models.

Stop spending hours writing custom scripts to handle schema mismatches. DataExtractor Pro identifies, analyzes, and transforms datasets through an intuitive interactive interface.

🌟 Key Features

HuggingFace Native: Seamlessly pulls datasets via ID or URL. Supports private repositories with secure token authentication.
Deep Structural Inspection: Recursively analyzes dataset hierarchies, including nested JSON objects and complex lists.
Memory-Efficient Streaming: Built on top of the datasets streaming API. Process datasets of any size (thousands to millions of rows) without exhausting RAM.
Dot-Notation Field Extraction: Effortlessly extract nested data using dot-notated paths (e.g., metadata.user.response).
Interactive Remapping: Rename fields on the fly to match your training script requirements (e.g., text_v1 -> instruction).
Multi-Format Export: One-click export to JSONL, CSV, or JSON, automatically organized into a local data/ directory.

🛠️ Technical Architecture

The tool follows a modular "Pipeline" design:

Loader: Manages HF API connections, authentication, and streaming configurations.
Inspector: Analyzes schema features and identifies data types and nesting levels.
Processor: Handles field selection, mapping, and extraction logic using generator patterns to maintain a low memory footprint.
Exporter: A dedicated factory module for extensible file formatting.

🚀 Getting Started

Prerequisites

Python 3.10+
A HuggingFace Access Token (optional, for private/gated datasets)

Installation

Clone the repository

git clone https://github.com/shri-the-tree/Dataset-Extractor.git
cd Dataset-Extractor

Install dependencies
```
pip install -r requirements.txt
```

Usage

Simply launch the interactive CLI:

python main.py

Step-by-Step Workflow:

Source Selection: Paste a HuggingFace Dataset ID or full URL.
Path Discovery: The tool automatically detects available subsets (configs) and splits (train, validation, harmful, etc.).
Select & Transform: Choose the fields you want and rename them if necessary.
Export: Choose your format and let the tool handle the heavy lifting.

📂 Project Structure

├── core/
│   ├── loader.py       # HF Hub Integration
│   ├── inspector.py    # Schema & Hierarchy Analysis
│   ├── processor.py    # Field Mapping Logic
│   └── exporter.py     # CSV/JSONL/JSON Writers
├── data/               # Default output directory
├── main.py             # CLI Entrypoint
└── requirements.txt    # Production dependencies

⚖️ License

Distributed under the MIT License. See LICENSE for more information.

Developed for rapid LLM fine-tuning and data engineering workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataExtractor Pro 🚀

🌟 Key Features

🛠️ Technical Architecture

🚀 Getting Started

Prerequisites

Installation

Usage

📂 Project Structure

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
core		core
data		data
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DataExtractor Pro 🚀

🌟 Key Features

🛠️ Technical Architecture

🚀 Getting Started

Prerequisites

Installation

Usage

📂 Project Structure

⚖️ License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages