This document describes the synthetic data generation pipeline for KoViDoRe v2.
The pipeline consists of three main stages:
- Corpus Building - Convert PDF documents to structured markdown
- Summary Generation - Generate single and cross-section summaries
- Query Generation - Generate synthetic queries from context or summaries
- False Negative Filtering - Filter out false negative pages for each query
Convert PDF documents to page-level markdown with image captioning and section splitting.
Command:
python build_corpus.py --subsets {subset}Process:
- Convert each PDF page to image
- Parse images using Upstage Document Parser
- Extract markdown content and section elements
- Save as corpus parquet file
Output: data/{subset}/corpus/data.parquet
Generate summaries for each individual section.
Command:
bash scripts/run.sh --subsets {subset} --task single_section_summaryProcess:
- Explode corpus by section elements
- Generate summary for each section using LLM
Output: data/{subset}/single_section_summary/
Combine multiple sections and generate cross-section summaries.
Command:
bash scripts/run.sh --subsets {subset} --task cross_section_summaryProcess:
- Randomly combine 3, 5, or 7 sections per document
- Generate cross-section summaries that synthesize information across sections
Output: data/{subset}/cross_section_summary/
Filter summaries to ensure diversity.
Note: This step was not used in the current pipeline. Instead, diverse summaries were considered during the query quality control process.
Two approaches for generating synthetic queries:
Generate queries directly from page markdown with surrounding context.
Command:
bash scripts/run.sh --subsets {subset} --task query_from_contextProcess:
- Create sliding windows of 5 or 7 consecutive pages
- Generate queries that require information from multiple pages
Output: data/{subset}/query_from_context/
Generate queries from cross-section summaries.
Command:
bash scripts/run.sh --subsets {subset} --task query_from_summaryProcess:
- Use cross-section summaries as input
- Generate queries based on synthesized information
Output: data/{subset}/query_from_summary/
Filter out false negative pages for each query type (query_from_context, query_from_summary).
Command:
# Filter false negatives for queries from context
bash scripts/run.sh --subsets {subset} --task filter_query_from_context
# Filter false negatives for queries from summary
bash scripts/run.sh --subsets {subset} --task filter_query_from_summaryProcess:
- For each query, check relevance against all corpus pages using LLM
- Identify pages that could incorrectly answer the query (false negatives)
- Filter out these false negative pages from the relevance set
Output: data/{subset}/filter_query_from_context/ or data/{subset}/filter_query_from_summary/
data/{subset}/
├── pdfs/ # Place PDF files here before running the pipeline
├── images/ # Generated page images from PDFs
├── corpus/
│ └── data.parquet # Parsed corpus data
├── seed/ # Preprocessed seed files for each task
│ ├── seed_for_single_section_summary.parquet
│ ├── seed_for_cross_section_summary.parquet
│ ├── seed_for_query_from_context.parquet
│ ├── seed_for_query_from_summary.parquet
│ ├── seed_for_filter_query_from_context.parquet
│ └── seed_for_filter_query_from_summary.parquet
├── single_section_summary/ # Pipeline output
├── cross_section_summary/ # Pipeline output
├── query_from_context/ # Pipeline output
├── query_from_summary/ # Pipeline output
├── filter_query_from_context/ # Pipeline output
└── filter_query_from_summary/ # Pipeline output
Note: Before running the pipeline, place your PDF files in
data/{subset}/pdfs/directory.
You can customize model providers in src/kovidore_data_generator/config.py:
from data_designer.essentials import ModelProvider, ModelConfig, ChatCompletionInferenceParams
# Define a new provider
custom_provider = ModelProvider(
name="custom",
endpoint="https://api.custom-provider.com/v1",
api_key="CUSTOM_API_KEY", # Environment variable name
)
# Add to model_providers list
model_providers = [upstage_provider, openai_provider, custom_provider]
# Define model configuration
model_configs = [
ModelConfig(
alias="custom-model", # Use this alias with --model_alias flag
model="model-name", # Actual model name from the provider
provider="custom", # Must match provider name
inference_parameters=ChatCompletionInferenceParams(
max_tokens=4096,
temperature=0.7,
top_p=0.9,
),
),
]Usage:
bash scripts/run.sh --subsets {subset} --task {task} --model_alias custom-model# 1. Build corpus
uv run python build_corpus.py --subsets {subset}
# 2. Summary generation
bash scripts/run.sh --subsets {subset} --task single_section_summary
bash scripts/run.sh --subsets {subset} --task cross_section_summary
# 3. Query generation (choose one or both)
bash scripts/run.sh --subsets {subset} --task query_from_context
bash scripts/run.sh --subsets {subset} --task query_from_summary
# 4. False negative filtering
bash scripts/run.sh --subsets {subset} --task filter_query_from_context
bash scripts/run.sh --subsets {subset} --task filter_query_from_summary
# 5. Quality control (manual review)





