This guide covers the various input and output formats supported by QuData, including configuration options and best practices for each format.
QuData supports a wide variety of input formats for maximum flexibility in data ingestion.
- Extensions:
.pdf - Features: Text extraction, table detection, image extraction, OCR for scanned PDFs
- Limitations: Complex layouts may affect extraction quality
from qudata.ingest import PDFExtractor
extractor = PDFExtractor()
result = extractor.extract("document.pdf")
print(f"Text content: {len(result.content)} characters")
print(f"Tables found: {len(result.tables)}")
print(f"Images found: {len(result.images)}")Configuration:
ingest:
pdf:
extract_tables: true
extract_images: true
ocr_fallback: true
ocr_languages: ["eng", "spa"]
preserve_layout: false- Extensions:
.docx,.doc - Features: Text extraction, formatting preservation, table extraction, embedded objects
- Limitations: Complex formatting may be simplified
from qudata.ingest import DocumentExtractor
extractor = DocumentExtractor()
result = extractor.extract("document.docx")
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Content: {result.content}")Configuration:
ingest:
docx:
preserve_formatting: true
extract_tables: true
extract_images: false
include_headers_footers: false- Extensions:
.html,.htm - Features: Content extraction, link preservation, metadata extraction, boilerplate removal
- Limitations: JavaScript-generated content not supported
from qudata.ingest import WebExtractor
extractor = WebExtractor()
# From file
result = extractor.extract("webpage.html")
# From URL
result = extractor.extract_from_url("https://example.com/article")
print(f"Title: {result.metadata.title}")
print(f"Clean content: {result.content}")Configuration:
ingest:
html:
remove_scripts: true
remove_styles: true
preserve_links: false
extract_metadata: true
readability_threshold: 0.7- Extensions:
.txt,.md,.rst - Features: Direct text ingestion, encoding detection, metadata extraction from headers
- Limitations: No structural information
from qudata.ingest import PlainTextExtractor
extractor = PlainTextExtractor()
result = extractor.extract("document.txt")
print(f"Encoding: {result.metadata.encoding}")
print(f"Content: {result.content}")- Extensions:
.csv - Features: Column-based data extraction, header detection, data type inference
- Use Cases: Tabular data, survey responses, structured datasets
from qudata.ingest import StructuredExtractor
extractor = StructuredExtractor()
result = extractor.extract("data.csv")
# Access structured data
for row in result.structured_data:
print(f"Row: {row}")Configuration:
ingest:
csv:
delimiter: ","
quote_char: '"'
encoding: "utf-8"
skip_blank_lines: true
infer_types: true- Extensions:
.json,.jsonl - Features: Hierarchical data extraction, schema inference, nested object handling
- Use Cases: API responses, configuration files, structured datasets
result = extractor.extract("data.json")
# Access JSON structure
json_data = result.structured_data
print(f"Keys: {list(json_data.keys())}")Configuration:
ingest:
json:
flatten_nested: false
max_depth: 10
extract_text_fields: true
text_field_patterns: ["content", "text", "description"]- Extensions:
.png,.jpg,.jpeg,.tiff,.bmp - Features: OCR text extraction, image preprocessing, confidence scoring
- Use Cases: Scanned documents, screenshots, diagrams with text
from qudata.ingest import OCRProcessor
processor = OCRProcessor()
result = processor.extract("scanned_document.png")
print(f"Extracted text: {result.content}")
print(f"OCR confidence: {result.metadata.ocr_confidence}")Configuration:
ingest:
ocr:
languages: ["eng", "spa", "fra"]
confidence_threshold: 0.8
preprocessing:
enhance_contrast: true
remove_noise: true
deskew: true
tesseract_config: "--psm 6"- Supported: PostgreSQL, MySQL, SQLite
- Features: Query-based extraction, schema introspection, batch processing
- Use Cases: Content management systems, application databases
from qudata.ingest import DatabaseExtractor
extractor = DatabaseExtractor()
result = extractor.extract_from_query(
connection_string="postgresql://user:pass@localhost/db",
query="SELECT title, content FROM articles WHERE published = true"
)Configuration:
ingest:
database:
connections:
- name: "content_db"
type: "postgresql"
host: "localhost"
database: "content"
username: "user"
password: "password"
batch_size: 1000
timeout: 30- Supported: MongoDB, Elasticsearch
- Features: Document-based extraction, flexible schema handling
- Use Cases: Content repositories, search indexes
result = extractor.extract_from_mongodb(
connection_string="mongodb://localhost:27017/content",
collection="articles",
query={"status": "published"}
)- Features: URL-based extraction, sitemap crawling, rate limiting
- Use Cases: News articles, blog posts, documentation sites
from qudata.ingest import WebScraper
scraper = WebScraper()
# Single URL
result = scraper.scrape_url("https://example.com/article")
# Multiple URLs
urls = ["https://example.com/page1", "https://example.com/page2"]
results = scraper.scrape_urls(urls)
# Sitemap crawling
results = scraper.scrape_sitemap("https://example.com/sitemap.xml")Configuration:
ingest:
web_scraping:
rate_limit: 10 # requests per second
timeout: 30
user_agent: "QuData/1.0"
respect_robots_txt: true
max_pages: 1000- Features: Feed parsing, content extraction, update tracking
- Use Cases: News feeds, blog updates, content syndication
from qudata.ingest import RSSExtractor
extractor = RSSExtractor()
result = extractor.extract_feed("https://example.com/feed.xml")
for item in result.items:
print(f"Title: {item.title}")
print(f"Content: {item.content}")QuData generates training-ready datasets in multiple formats optimized for different use cases.
JSON Lines format for general-purpose LLM training.
Structure:
{"text": "Document content", "metadata": {"source": "file.pdf", "quality": 0.85}, "labels": ["category1"]}
{"text": "Another document", "metadata": {"source": "file2.pdf", "quality": 0.92}, "labels": ["category2"]}Usage:
from qudata.pack import JSONLFormatter
formatter = JSONLFormatter()
formatter.export_to_file(
documents=processed_docs,
output_path="training_data.jsonl",
fields=["text", "metadata", "quality_score", "labels"]
)Configuration:
export:
jsonl:
fields: ["text", "metadata", "quality_score", "labels"]
filter_low_quality: true
min_quality_score: 0.7
max_tokens_per_line: 8192
include_empty_lines: falseConversational format optimized for chat model training.
Structure:
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is a subset of AI..."}
],
"metadata": {
"source": "ml_textbook.pdf",
"quality_score": 0.89,
"topics": ["machine learning", "AI"]
}
}Usage:
from qudata.pack import ChatMLFormatter
formatter = ChatMLFormatter()
chatml_data = formatter.format_documents(
documents=processed_docs,
system_message="You are a helpful assistant.",
include_metadata=True
)Configuration:
export:
chatml:
system_message: "You are a helpful assistant."
include_metadata: true
max_tokens_per_message: 4096
conversation_format: "qa" # or "instruction", "dialogue"Columnar format optimized for analytics and large-scale processing.
Structure:
- Efficient storage and querying
- Schema evolution support
- Compression and encoding optimizations
Usage:
from qudata.pack import ParquetFormatter
formatter = ParquetFormatter()
formatter.export_to_file(
documents=processed_docs,
output_path="training_data.parquet",
compression="snappy"
)Configuration:
export:
parquet:
compression: "snappy" # or "gzip", "lz4", "brotli"
row_group_size: 50000
page_size: 1024
schema_validation: trueHuman-readable format for inspection and debugging.
Structure:
=== Document 1 ===
Source: document.pdf
Quality: 0.85
Topics: technology, AI
Language: en
Content goes here...
---
=== Document 2 ===
...
Usage:
from qudata.pack import PlainTextFormatter
formatter = PlainTextFormatter()
text_output = formatter.format_documents(
documents=processed_docs,
separator="\n---\n",
include_headers=True
)Configuration:
export:
plain_text:
separator: "\n---\n"
include_headers: true
include_metadata: true
max_line_length: 120
wrap_text: trueCreate custom export formats for specific requirements.
from qudata.pack import BaseFormatter
class CustomFormatter(BaseFormatter):
def format_document(self, document):
return {
"id": document.id,
"content": document.content,
"custom_field": self.extract_custom_data(document)
}
def export_to_file(self, documents, output_path):
formatted_data = [self.format_document(doc) for doc in documents]
# Custom export logic hereexport:
output_dir: "/data/processed"
formats: ["jsonl", "chatml", "parquet"]
# Quality filtering
filter_low_quality: true
min_quality_score: 0.7
# Dataset splitting
splits:
enabled: true
ratios: [0.8, 0.1, 0.1] # train, validation, test
stratify_by: "domain"
random_seed: 42
# Metadata inclusion
include_metadata: true
metadata_fields: ["source", "quality_score", "language", "topics"]export:
jsonl:
fields: ["text", "metadata", "quality_score"]
filter_empty: true
max_tokens_per_line: 8192
encoding: "utf-8"
compression: "gzip"export:
chatml:
system_message: "You are a helpful assistant."
conversation_format: "qa"
include_metadata: true
max_tokens_per_message: 4096
role_mapping:
instruction: "user"
response: "assistant"export:
parquet:
compression: "snappy"
row_group_size: 50000
page_size: 1024
use_dictionary: true
write_statistics: true- Text-based PDFs: Highest quality, direct text extraction
- Scanned PDFs: OCR-dependent, may have errors
- Complex layouts: Tables and multi-column layouts may be challenging
- Clean articles: High quality with proper content extraction
- Complex pages: Navigation and ads may affect quality
- Dynamic content: JavaScript-generated content not captured
- Image resolution: Higher resolution improves accuracy
- Text clarity: Clear, high-contrast text works best
- Language support: Accuracy varies by language
# Check quality distribution
quality_scores = [doc.quality_score for doc in documents]
print(f"Average quality: {sum(quality_scores) / len(quality_scores):.2f}")
# Filter by quality
high_quality = [doc for doc in documents if doc.quality_score >= 0.8]
print(f"High quality documents: {len(high_quality)}")from qudata.validation import DatasetValidator
validator = DatasetValidator()
result = validator.validate_dataset(documents)
if not result.is_valid:
for issue in result.issues:
print(f"{issue.severity}: {issue.message}")-
Format Selection
- Prefer text-based formats over image-based when possible
- Use structured formats (JSON, CSV) for tabular data
- Consider OCR quality for scanned documents
-
Quality Preprocessing
- Clean HTML content before processing
- Validate structured data schemas
- Check encoding for text files
-
Batch Processing
- Process similar formats together
- Use parallel processing for large datasets
- Monitor memory usage with large files
-
Format Choice
- Use JSONL for general training datasets
- Use ChatML for conversational models
- Use Parquet for analytics and large datasets
-
Quality Control
- Set appropriate quality thresholds
- Review sample outputs manually
- Validate export formats
-
Dataset Management
- Use consistent naming conventions
- Version control configuration files
- Document format specifications
-
Memory Management
performance: streaming_mode: true batch_size: 100 max_memory_usage: "8GB"
-
Parallel Processing
ingest: parallel_processing: true max_workers: 8
-
Caching
performance: caching: enabled: true cache_dir: "/tmp/qudata_cache"
-
Encoding Problems
# Check for encoding issues try: content = file.read().decode('utf-8') except UnicodeDecodeError: # Try alternative encodings content = file.read().decode('latin-1')
-
Memory Issues
# Monitor memory usage import psutil memory_percent = psutil.virtual_memory().percent if memory_percent > 80: print("Warning: High memory usage")
-
Quality Issues
# Investigate low quality scores low_quality = [doc for doc in documents if doc.quality_score < 0.5] for doc in low_quality[:5]: print(f"File: {doc.source_path}") print(f"Issues: {doc.quality_issues}")
This guide provides comprehensive coverage of all supported formats and their optimal usage patterns. For specific format requirements or custom implementations, refer to the API documentation and configuration examples.