- Chain multiple LLM handlers in sequence
- Automatic retry with exponential backoff (up to 5 retries)
- Skip failed datapoints after exhausting retries, continue processing
- Automatic progress saving to
cache.jsonlwith resume capability - Built-in rate limiting and concurrency control
- JSON schema validation and Pydantic model integration
- File-based logging to
run.logswith minimal console output - Better Error Handling, and if the retries fail, the error is logged, and the datapoint is skipped.
- Clean up codebase structure for better readability
- Optimize the code for larger datasets [Especially, don't load all data into memory, since caching is already set.]
- Representative question selection using
SemHashandQwen-3-4B-embeddings(or proprietary embeddings, depending on cost) - Convert the representative question selection into a YAML-configurable handler
- Add Image inputs
- Implement OpenAI function calling support
- More sophisticated error handling strategies
- Naive cost tracking
- Better checkpoint management and recovery
- YAML schema validation
- Visual representation of pipeline flow
P.S: (Generated using Claude-Sonnet-3.7-Thinking, needs refinement, but does the basic job)
# Copy and configure environment variables
cp src/.env.template src/.env
# Edit src/.env with your OpenAI API key# Using uv (recommended)
uv sync
# Or using pip
pip install -r requirements.txtcd src
python main.py --config config/pipeline.yaml --output output.jsonl To directly push the output to a Hugging Face repo, add the --hf_repo argument:
python main.py --config config/pipeline.yaml --output output.jsonl --hf_repo your_hf_username/your_repo_name# Required
OPENAI_API_KEY=your_openai_api_key_here# config/your_handler.yaml
system_prompt: "You are a helpful assistant."
prompt: "Process this text: {text}"
prompt_variable_maps:
"{text}": "input_text" # Maps {text} to data_point["input_text"]
model:
name: "gpt-4"
base_url: null # Optional custom endpoint
max_req_per_minute: 100
max_concurrent_req: 10
config:
generation_config:
temperature: 0.7
max_tokens: 1000
# Optional: For structured output
response_schema:
type: "object"
properties:
result:
type: "string"
required: ["result"]# config/pipeline.yaml
data_path: "data/input.jsonl"
handlers:
- analyzer: "config/doc_analysis.yaml"
- generator: "config/question_generation.yaml"
- reasoner: "config/reasoning.yaml"Input data should be in JSONL format:
{"input_text": "Your text to process", "metadata": "optional"}
{"input_text": "Another text sample", "metadata": "optional"}