QA-Generation Pipeline v0.0.1

Tasks Report

What's Done

Chain multiple LLM handlers in sequence
Automatic retry with exponential backoff (up to 5 retries)
Skip failed datapoints after exhausting retries, continue processing
Automatic progress saving to cache.jsonl with resume capability
Built-in rate limiting and concurrency control
JSON schema validation and Pydantic model integration
File-based logging to run.logs with minimal console output
Better Error Handling, and if the retries fail, the error is logged, and the datapoint is skipped.

What's Left

🚀 Quick Start

P.S: (Generated using Claude-Sonnet-3.7-Thinking, needs refinement, but does the basic job)

1. Setup Environment

# Copy and configure environment variables
cp src/.env.template src/.env
# Edit src/.env with your OpenAI API key

2. Install Dependencies

# Using uv (recommended)
uv sync

# Or using pip
pip install -r requirements.txt

3. Run Pipeline

cd src
python main.py --config config/pipeline.yaml --output output.jsonl

To directly push the output to a Hugging Face repo, add the --hf_repo argument:

python main.py --config config/pipeline.yaml --output output.jsonl --hf_repo your_hf_username/your_repo_name

📝 Configuration Guide

Environment Setup (`.env`)

# Required
OPENAI_API_KEY=your_openai_api_key_here

Handler Configuration Template

# config/your_handler.yaml
system_prompt: "You are a helpful assistant."
prompt: "Process this text: {text}"

prompt_variable_maps:
  "{text}": "input_text"  # Maps {text} to data_point["input_text"]

model:
  name: "gpt-4"
  base_url: null  # Optional custom endpoint
  max_req_per_minute: 100
  max_concurrent_req: 10

  config:
    generation_config:
      temperature: 0.7
      max_tokens: 1000

    # Optional: For structured output
    response_schema:
      type: "object"
      properties:
        result:
          type: "string"
      required: ["result"]

Pipeline Configuration Template

# config/pipeline.yaml
data_path: "data/input.jsonl"

handlers:
  - analyzer: "config/doc_analysis.yaml"
  - generator: "config/question_generation.yaml"
  - reasoner: "config/reasoning.yaml"

Data Format

Input data should be in JSONL format:

{"input_text": "Your text to process", "metadata": "optional"}
{"input_text": "Another text sample", "metadata": "optional"}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
old.md		old.md
pyproject.toml		pyproject.toml
temp_data.jsonl		temp_data.jsonl
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QA-Generation Pipeline v0.0.1

Tasks Report

What's Done

What's Left

🚀 Quick Start

1. Setup Environment

2. Install Dependencies

3. Run Pipeline

📝 Configuration Guide

Environment Setup (`.env`)

Handler Configuration Template

Pipeline Configuration Template

Data Format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

QA-Generation Pipeline v0.0.1

Tasks Report

What's Done

What's Left

🚀 Quick Start

1. Setup Environment

2. Install Dependencies

3. Run Pipeline

📝 Configuration Guide

Environment Setup (.env)

Handler Configuration Template

Pipeline Configuration Template

Data Format

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Environment Setup (`.env`)

Packages