QuData is a comprehensive data processing pipeline designed to transform raw multi-format data into high-quality datasets optimized for LLM training. It handles everything from data ingestion and cleaning to annotation and export in various training formats.
QuData supports a wide range of formats:
- Documents: PDF, DOCX, ODT, RTF, TXT, MD
- Web: HTML, XML
- Structured: CSV, JSON, JSONL, YAML
- Images: PNG, JPG, JPEG, TIFF (for OCR)
- Archives: ZIP, TAR, GZ
- Code: Jupyter notebooks, source code files
QuData uses multi-dimensional quality scoring that evaluates:
- Content quality: Informativeness, coherence, completeness
- Language quality: Grammar, spelling, fluency
- Structure quality: Organization, formatting
- Metadata completeness: Author, date, source information
Yes! QuData is highly configurable through YAML configuration files. You can:
- Enable/disable processing stages
- Adjust quality thresholds
- Customize cleaning rules
- Define custom taxonomies
- Set export formats and options
Minimum Requirements:
- Python 3.8 or higher
- 4GB RAM
- 10GB free disk space
Recommended:
- Python 3.9+
- 8GB+ RAM
- 50GB+ free disk space
- Multi-core CPU for parallel processing
# Basic installation
pip install -e .
# With optional dependencies
pip install -e ".[ml,web,dev]"
# For development
pip install -e ".[dev]"Yes, for OCR functionality you need:
Ubuntu/Debian:
sudo apt-get install tesseract-ocr libtesseract-devmacOS:
brew install tesseractWindows: Download from: https://github.com/UB-Mannheim/tesseract/wiki
QuData supports multiple databases:
PostgreSQL (Recommended):
# Install PostgreSQL
sudo apt-get install postgresql postgresql-contrib
# Create database
sudo -u postgres createdb qudata
sudo -u postgres createuser qudata_userSQLite (Simple setup): No additional setup required - QuData will create the database file automatically.
Processing time depends on:
- File size and count: Larger datasets take longer
- File types: PDFs with images take longer than plain text
- Quality settings: Higher quality thresholds require more processing
- Hardware: More CPU cores and RAM speed up processing
Typical rates:
- Plain text: 100-500 documents/minute
- PDFs: 10-50 documents/minute
- Web scraping: 30-100 pages/minute
Yes! Enable parallel processing in your configuration:
pipeline:
parallel_processing: true
max_workers: 8 # Adjust based on CPU cores
batch_size: 100QuData has robust error handling:
- Continue on error: Processing continues with other files
- Error logging: Detailed logs of what went wrong
- Retry logic: Automatic retries for transient failures
- Checkpointing: Resume from where processing stopped
For large datasets:
- Enable streaming mode:
pipeline:
streaming_mode: true
max_memory_usage: "4GB"- Process in chunks:
def process_large_dataset(files, chunk_size=1000):
for i in range(0, len(files), chunk_size):
chunk = files[i:i+chunk_size]
pipeline.process_files(chunk)- Use incremental processing:
from qudata.database import IncrementalProcessor
processor = IncrementalProcessor(connection)
new_docs = processor.get_new_documents(since=last_run)- Start with a template:
cp configs/templates/academic-papers.yaml configs/my-config.yaml- Modify settings:
pipeline:
name: "my_custom_pipeline"
clean:
min_quality_score: 0.8 # Higher threshold
export:
formats: ["jsonl", "parquet"]- Validate configuration:
from qudata.config import ConfigManager
config_manager = ConfigManager()
config = config_manager.load_pipeline_config("configs/my-config.yaml")- academic-papers.yaml: Optimized for research papers, higher quality thresholds, citation extraction
- web-content.yaml: Handles web articles, aggressive boilerplate removal, lower quality thresholds
- code-documentation.yaml: Preserves code blocks, technical entity recognition, programming language detection
clean:
boilerplate:
custom_patterns:
- "Your custom pattern here"
- "Copyright \\d{4}.*"
- "All rights reserved.*"
html:
remove_elements:
- "custom-ad-class"
- "social-media-widget"Yes! Use ${VARIABLE_NAME} syntax:
database:
host: "${DB_HOST}"
username: "${DB_USER}"
password: "${DB_PASSWORD}"Set environment variables:
export DB_HOST="localhost"
export DB_USER="qudata_user"
export DB_PASSWORD="secure_password"Common reasons for low quality scores:
- Short documents: Increase minimum length or lower thresholds
- Poor language quality: Enable OCR correction or language filtering
- Boilerplate content: Improve boilerplate removal patterns
- Mixed languages: Set target languages or improve detection
Debug quality issues:
from qudata.score import QualityScorer
scorer = QualityScorer({'detailed_analysis': True})
result = scorer.score_document(document)
print(result.detailed_scores)
print(result.issues)For PDFs:
ingest:
pdf:
preserve_layout: true
extract_tables: true
ocr_fallback: true
ocr_confidence_threshold: 0.7For web content:
ingest:
web:
extract_main_content: true
remove_navigation: true
use_readability: trueFor documents:
ingest:
document:
preserve_formatting: true
extract_tables: true
extract_properties: trueQuData supports multiple export formats:
- JSONL: Standard format for LLM training
- ChatML: Conversational format for chat models
- Alpaca: Instruction-following format
- Parquet: Efficient columnar format for analytics
- CSV: Simple tabular format
export:
splitting:
enabled: true
train_ratio: 0.8
validation_ratio: 0.1
test_ratio: 0.1
stratify_by: "domain" # Ensure balanced splits
random_seed: 42- Increase parallel workers:
pipeline:
max_workers: 16 # More workers- Use faster algorithms:
clean:
deduplication:
algorithm: "minhash" # Faster than jaccard- Reduce quality checks:
quality:
enabled: false # Skip quality scoring- Enable caching:
pipeline:
enable_caching: true
cache_directory: ".cache"Common bottlenecks:
- Large files: PDFs with many images
- Complex cleaning: Extensive deduplication
- Quality scoring: Detailed quality analysis
- Single-threaded: Not using parallel processing
- Memory constraints: Frequent garbage collection
Profile performance:
import cProfile
cProfile.run('pipeline.process_directory("data/raw", "data/processed")')Memory usage depends on:
- Batch size: Larger batches use more memory
- File sizes: Large documents require more memory
- Processing stages: Some stages are memory-intensive
Monitor memory usage:
import psutil
import os
process = psutil.Process(os.getpid())
print(f"Memory usage: {process.memory_info().rss / 1024 / 1024:.1f} MB")Reduce memory usage:
pipeline:
batch_size: 50 # Smaller batches
streaming_mode: true
max_memory_usage: "2GB"export:
llmbuilder:
enabled: true
llmbuilder_path: "/path/to/llmbuilder"
auto_trigger: true
model_config:
model_type: "llama"
size: "7b"Yes! QuData exports standard formats:
- Hugging Face: Use JSONL or Parquet exports
- OpenAI: Use JSONL format
- Custom frameworks: Use CSV or JSON exports
Python API:
from qudata import QuDataPipeline
pipeline = QuDataPipeline(config_path="my-config.yaml")
dataset = pipeline.process_files(file_list)
export_path = pipeline.export_dataset(dataset, "jsonl")REST API:
curl -X POST "https://api.qudata.com/v1/datasets" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"name": "My Dataset", "files": ["file1.pdf"]}'Command Line:
qudata process --input data/raw --output data/processed --config my-config.yamlSolutions:
- Reduce batch size
- Enable streaming mode
- Increase system memory
- Process in smaller chunks
pipeline:
batch_size: 25
streaming_mode: true
max_memory_usage: "2GB"Solutions:
- Enable OCR fallback
- Improve OCR preprocessing
- Check PDF integrity
ingest:
pdf:
ocr_fallback: true
ocr_confidence_threshold: 0.6
preprocessing:
enhance_contrast: true
denoise: trueSolutions:
- Reduce request rate
- Use proper user agent
- Add delays between requests
ingest:
web:
requests_per_minute: 30
delay_between_requests: 2
user_agent: "Mozilla/5.0 (compatible; QuData/1.0)"Check connection settings:
from qudata.database import DatabaseConnector
config = {
'type': 'postgresql',
'host': 'localhost',
'port': 5432,
'database': 'qudata',
'username': 'user',
'password': 'password'
}
try:
connector = DatabaseConnector(config)
connection = connector.connect()
print("Connection successful!")
except Exception as e:
print(f"Connection failed: {e}")Common issues:
- Invalid YAML syntax
- Missing required fields
- Invalid parameter values
Validate configuration:
from qudata.config import ConfigManager
from pydantic import ValidationError
try:
config_manager = ConfigManager()
config = config_manager.load_pipeline_config("my-config.yaml")
except ValidationError as e:
for error in e.errors():
print(f"{error['loc']}: {error['msg']}")Yes! Extend the base extractor class:
from qudata.models import BaseExtractor, ExtractedContent
class CustomExtractor(BaseExtractor):
def extract(self, file_path: str) -> ExtractedContent:
# Your custom extraction logic
content = self.extract_custom_format(file_path)
return ExtractedContent(
content=content,
metadata=self.extract_metadata(file_path)
)
def supports_format(self, file_type: str) -> bool:
return file_type == "custom_format"from qudata.score import QualityScorer
class CustomQualityScorer(QualityScorer):
def calculate_custom_score(self, document):
# Your custom quality logic
return score
def score_document(self, document):
base_result = super().score_document(document)
custom_score = self.calculate_custom_score(document)
# Combine scores
base_result.custom_score = custom_score
return base_resultYes! For classification and NER:
from qudata.annotate import TaxonomyClassifier
class CustomClassifier(TaxonomyClassifier):
def __init__(self, config):
super().__init__(config)
self.model = self.load_custom_model()
def classify_document(self, document):
# Use your custom model
predictions = self.model.predict(document.content)
return self.format_results(predictions)Docker deployment:
docker-compose up -d --scale qudata-worker=5Kubernetes deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: qudata-workers
spec:
replicas: 10
selector:
matchLabels:
app: qudata-worker
template:
spec:
containers:
- name: qudata-worker
image: qudata:latest
command: ["qudata", "worker"]Distributed processing:
from qudata.orchestrate import WorkflowOrchestrator
orchestrator = WorkflowOrchestrator({
'backend': 'celery',
'broker_url': 'redis://localhost:6379/0',
'workers': 10
})
# Distribute processing across workers
orchestrator.process_dataset_distributed(large_dataset)- Module documentation: Check README files in
src/forge/*/ - API documentation: See
docs/api/ - Examples: Look at
examples/directory - Configuration: Review
configs/templates/
- Check existing issues: Search the issue tracker
- Provide details: Include configuration, error messages, sample data
- Minimal reproduction: Create a simple example that reproduces the issue
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Follow code style guidelines
- Submit a pull request
- GitHub Discussions: For questions and community help
- Issue Tracker: For bug reports and feature requests
- Documentation: Comprehensive guides and API reference