Skip to content

An end-to-end, automated PDF-to-CSV processing system leveraging computer vision, OCR, and large language models (LLMs) for highly accurate tabular data extraction.

Notifications You must be signed in to change notification settings

AdamAdham/pdf-tables-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Extraction Pipeline

An end-to-end, automated PDF-to-CSV processing system leveraging computer vision, OCR, and large language models (LLMs) for highly accurate tabular data extraction.

This project is designed to eliminate manual QA bottlenecks, providing fully structured outputs with traceable logs and retry mechanisms to ensure 100% accuracy, reducing operational costs by 90% (~$100,000 annually).

Key features include:

  • Robust PDF preprocessing and table detection using state-of-the-art CV methods.
  • Precise OCR extraction of text with positional data.
  • Structured data parsing and integration into LLM-driven CSV generation.
  • Dynamic postprocessing, evaluation, and retry loops for automatic error correction.
  • Detailed logging for real-time monitoring and traceability.

Key Highlights

  • End-to-end automation: From raw PDF detection to validated CSV outputs.
  • Error resilience: Automatic retries for LLM outputs and conversion issues.
  • Validation & QA: Dynamic checks at every stage, with detailed evaluation reports.
  • Cost savings: Reduces manual verification effort by 90%, cutting operational costs.
  • Traceable outputs: Timestamped logs for every processing step.

Folder Structure

archive/
notebooks/

src/
detect_extract/
  • config.py
  • doc.md
  • run_detect_extract.py
  • utils.py
  • __init__.py
evaluation/
  • comparison.py
  • evaluation.py
  • locating_failures.py
  • summarization.py
  • types.py
  • utils.py
get_metadata/
  • get_metadata.py
  • utils.py
Details
llm/
  • input_llm.py
  • utils.py
prompts/
  • description_factory.py
  • notes.md
  • system_prompt.j2
  • user_prompt.j2

pipeline/
  • main.py
  • types.py
  • utils.py
postprocessing/
  • postprocess.py
  • types.py
  • utils.py
utils/
  • constants.py
  • types.py
  • utils.py
  • verbose_printer.py
  • __init__.py


Overall System Flow

The current architecture implements a fully automated document-to-dataset pipeline, starting from raw PDF detection to validated, structured outputs. Each stage builds on the previous one, ensuring robustness, traceability, and quality control through iterative retries and evaluation.

detect_extract → get_metadata → llm → pipeline

Description: The new architecture integrates postprocessing, retry logic, and evaluation within a unified pipeline. It introduces dynamic feedback loops to automatically recover from LLM or data inconsistencies.

Step-by-Step Flow

Pipeline Diagram

  1. detect_extract

    1. Detects and extracts table boundaries.
    2. Extracts raw words with positional data from PDFs.
    3. Formats words into lines.
    4. Outputs structured .json with keys: words and file_name.
  2. get_metadata

    1. Reads the files from detect_extract.
    2. Derives operator name and download date from file_name.
    3. Extracts and formats total values for downstream validation.
    4. Outputs structured .json with additional keys: totals and operator, date.
  3. llm

    • Converts extracted page data into structured CSV-like content.
    • Uses a central prompt template, with operator specific entries.
    • Saves initial LLM outputs for each PDF in .json with keys: words and page_num.
  4. pipeline

    • Central controller handling postprocessing, evaluation, and performance-based retrial mechanisms.

    • Executes the following decision loop:

    1. Postprocess

      • Handles structured data cleaning and unexpected outputs.
      • Handles retrials for conversion to dataframe errors.
      • Handles logging of any change from LLM raw output to structured CSV.
    2. Evaluation

      • Evaluates each pdf according to agreed upon validation checks (most prominently: total check).
      • Flags and locates problems, narrowing down to specific lines in each page.
    3. Retrial Mechanism

      • If evaluation fails → triggers a selective LLM retry (page-level).
      • Concatenates to prompt, the specific problem at each line and previous output.
      • Re-runs postprocessing + evaluation for retried segments.
      • Continues until all PDFs pass or maximum retry attempts reached.
    • Outputs:
      • Cleaned and verified postprocessed data.
      • Evaluation reports and flags.
      • Logs for every step, timestamped for traceability.

High-Level Summary

Stage Responsibility Output Error Handling
detect_extract Extract words, tables, and positions from PDFs Raw extracted JSON/text Skips corrupted pages
get_metadata Derive operator, date, and totals Enriched metadata JSON Logs missing/invalid files
llm Generate structured tabular predictions LLM page-level CSVs Logs model failures
pipeline Integrate postprocessing + evaluation + reruns Final verified tables + reports Retries failed pages automatically

Folder Descriptions

This is a detailed description of the each folder and file. Each stage is takes as input the previous stage's output, unless stated otherwise.

detect_extract/

This folder handles table detection and text extraction from PDFs using the Microsoft Table Transformer.

  1. Load image using fitz.
    • pdf2image had similar accuracy; however, fitz was ~25x faster and ~3x more memory efficient (supported by notebooks/fitz_pdfplumber_pdf2image).
    • Functions: extract_text_from_pdf
  2. Detects and extracts table boundaries using Microsoft Table Transformer.
    • Functions: extract_text_from_pdf
  3. Extracts raw words with positional data from PDFs using fitz.
    • PDFPlumber had similar accuracy; however, fitz was ~150x faster (supported by notebooks/fitz_pdfplumber_pdf2image).
    • Functions: extract_text_from_pdf
  4. Formats words into lines with separators (eg:|) such that the content is more structured for LLM ingestion.
    • Functions: reconstruct_text_add_location
  5. Outputs structured .json with keys: words and file_name.
    • Output words is a list of string. Each index is a page, where each row is separated by a break line \n.

Detailed Description

Entry Point: run_detect_extract.py: run this script to extract and format text from the PDF tables.

Files:

  • config.py: Configures the Table Transformer pipeline, sets devices, relative paths, and builds absolute paths for models and configs.
  • utils.py: Contains the main functions for detecting tables, extracting text, reconstructing lines, and adding locations in PDFs.
  • run_detect_extract.py: Script to run table detection and extraction. Initializes the pipeline and calls process_pdfs.
  • doc.md: Optional documentation or notes specific to this module.

Key functions in utils.py:

  1. scale_pdf_coord: Converts bounding box coordinates between PDF points and image pixels.
  2. extract_text_from_pdf: Detects table regions and extracts word tokens from PDFs.
  3. reconstruct_text_add_location: Reconstructs lines from extracted words with horizontal positions.
  4. process_pdfs: Processes one or multiple PDFs and optionally saves output to JSON.
  5. handle_detect_extract_args: Parses standardized command-line arguments for this module.

Example Output (Simplified)

{
    "words": [
        "row0: |245.2|Prod|261.0| |291.0|Oil|300.3| |332.0|Oil|341.3| |367.2|Gas|380.3| |409.2|Gas|422.3| |442.3|Water|461.6| |479.5|Water|498.9| |513.4|Casing|536.6| |547.6|Tubing|570.8| |615.4|Down|634.3|\nrow1: |20.3|Well|34.6| |36.5|ID|43.5| |63.1|Well|77.3| |79.3|Name|98.2| |173.3|API|184.9| |246.1|Date|261.1| |284.5|Prod|300.3| |323.1|Sales|341.3| |364.4|Prod|380.3| |404.1|Sales|422.3| |445.7|Prod|461.5| |480.6|Inject|498.8| |521.5|Pres|536.6| |553.8|Pres.|570.8| |581.6|Choke|602.8| |620.0|time|634.3| |646.8|Downtime|680.0| |681.9|Reason|707.0|
        \nrow2: |18.8|303969|39.7| |56.8|BINDER|80.4| |82.1|FED|94.7| |96.4|2-2HZ|113.7| |115.5|CLCDFH|140.8| |173.8|05123507300000|222.5| |221.3|2025-03-31|253.3| |285.0|44.46|300.7| |326.5|44.52|342.2| |361.0|479.46|380.2| |403.3|479.46|422.4| |451.0|8.94|463.2| |488.0|0.00|500.2| |518.5|317.07|537.7| |551.0|184.48|570.2| |584.0|162.00|603.2|",
        
        "row0: |245.2|Prod|261.0| |291.0|Oil|300.3| |332.0|Oil|341.3| |367.2|Gas|380.3| |409.2|Gas|422.3| |442.3|Water|461.6| |479.5|Water|498.9| |513.4|Casing|536.6| |547.6|Tubing|570.8| |615.4|Down|634.3|\nrow1: |20.3|Well|34.6| |36.5|ID|43.5| |63.1|Well|77.3| |79.3|Name|98.2| |173.3|API|184.9| |246.1|Date|261.1| |284.5|Prod|300.3| |323.1|Sales|341.3| |364.4|Prod|380.3| |404.1|Sales|422.3| |445.7|Prod|461.5| |480.6|Inject|498.8| |521.5|Pres|536.6| |553.8|Pres.|570.8| |581.6|Choke|602.8| |620.0|time|634.3| |646.8|Downtime|680.0| |681.9|Reason|707.0|"
        ],
    "file_name": "PDSWDX-DP-Anadarko-2025-07-31"
}

Example Run

python -m src.detect_extract.run_detect_extract --pages-to-detect 0 -1 --verbosity 0 --show-time --input-path ./data/raw --output-path ./data/extracted/no_detect_trial

get_metadata

This folder handles gathering of metadata. Operator and data of pdf download trivially extracted from file name. Total/Sums of columns are extracted from the preprocessed word (essential for live evaluation).

  1. Derives operator name and download date from file_name.
    • Functions: get_operator_download_date
  2. Extracts and formats total values for downstream validation.
    • Functions: get_totals
  3. Outputs structured .json with detect_extract keys + totals and operator, date.
    • totals is a dict with page_num, row_num, row_num_adjusted (row_num - rows taken by columns), row_data (lines that contain 'total'), numbers (extracted numbers)

Detailed Description

Entry Point: get_metadata: Main entry point that gets metadata and saves it in an output directory.

Files:

utils.py: Contains all helper functions for metadata processing. get_metadata: Main entry point that gets metadata and saves it in an output directory.

Key functions in utils.py:

process_metadata_files: Iterates over .json files in an input directory, updates each file with extracted operator, download date, and computed totals, and writes the enriched JSON to an output directory.

parse_number: Converts numeric strings (with commas) to floats; returns None if conversion fails.

get_operator_download_date: Extracts operator name and download date from a PDF file name following a specific naming convention.

get_totals: Identifies total rows in reconstructed PDF text, parsing numeric values and adjusting row indices for multi-line rows.

handle_get_metadata_args: Parses and returns standardized command-line arguments for metadata processing, including input/output directories and logging options.

extract_triplets: Extracts structured triplets from lines of text using regex patterns, typically for numeric and textual data.

Example Output (Simplified)

  "totals": [
    {
      "page_num": 68,
      "row_num": 3,
      "row_num_adjusted": 0,
      "row_data": "row3: |311.9|Total:|335.7| |421.4|208,316.47|461.9| |470.2|3,128,216.28|517.4| |532.4|434,021.68|572.9|",
      "numbers": [
        208316.47,
        3128216.28,
        434021.68
      ]
    }
  ],
  "operator": "camino",
  "date": "2025-07-30"

Example Run

python -m src.get_metadata.get_metadata --input-path ./data/extracted/test --output-path ./data/preprocessed/test --verbosity 5 --show-time

llm

This folder handles processing PDFs with LLMs, generating structured outputs from PDF text using system and user prompts. It supports page-level processing, including safe argument handling and prompt templating.

  • Uses a central user and system prompt templates, with operator specific entries.
  • Saves initial LLM outputs for each PDF in .json with keys: words and page_num.
  • Each page is a different LLM request.

Detailed Description

Entry Point: input_llm.py: Main entry point for processing PDFs with LLMs, orchestrating prompt construction, page-level calls, and output saving.

Files:

utils.py: Contains all helper functions for LLM processing. prompts/: Stores prompt templates and related resources:

  • description_factory.py: Generates structured descriptions for prompts.
  • notes.md: Notes on prompt design and usage.
  • system_prompt.j2: Jinja2 template for system prompts.
  • user_prompt.j2: Jinja2 template for user prompts.

Key functions in utils.py:

get_llm_output_pdf: Processes a PDF document with an LLM, generating outputs for each page in a specified range and optionally saving results.

get_llm_output_page: Processes a single page using an LLM and returns structured output including page number and original words.

convert_to_messages: Converts message objects to OpenAI-compatible format (system, user, assistant).

get_llm_output_raw: Processes a PDF using the OpenRouter API, saving output per page and/or per PDF.

get_llm_output_page_raw: Processes a single page using OpenRouter API and returns structured output.

get_prompt_template: Retrieves a prompt template from a local file or LangSmith service.

get_prompt_template_jinja: Loads a Jinja2 template from a directory and compiles it for rendering.

load_path_default: Loads JSON configuration from a path, falling back to a default dictionary.

handle_llm_args: Sets up and parses command-line arguments for LLM processing, initializing clients and configurations.

get_llm_output_pdfs: Processes multiple PDFs from a directory or a single PDF file using an LLM with given prompt templates and configurations.

get_llm_output_page_wrapper: Wrapper for get_llm_output_page to handle prompt formatting, additional context, and previous outputs.

call_llm_with_safe_args: Calls an LLM function with filtered arguments, ensuring only valid parameters are passed.

get_safe_llm_args: Filters a dictionary of arguments to include only those accepted by a specific function.

Example Output (Simplified)

{
  "llm_output": "API,Well Name,Production Date,Oil Production,Gas Production,Water Production\n3305308768,\"A JOHNSON 5397 43-33 10B\",2025-08-14,7.25,98.53,21.25\n3305308768,\"A JOHNSON 5397 43-33 10B\",2025-08-13,6.63,110.91,20.00\n3305308768,\"A JOHNSON 5397 43-33 10B\",2025-08-12,5.59,102.44,23.07\n3305308768,\"A JOHNSON 5397 43-33 10B\",2025-08-11,8.85,101.89,22.04\n3305308768,\"A JOHNSON 5397 43-33 10B\",2025-08-10,7.45,103.58,21.11",
  "page_num": 0,
  "words": "row0: |509.5|Oil|518.7| |542.9|Gas|556.0| |573.8|Water|593.1| |608.1|Casing|631.3| |646.5|Tubing|669.7| |722.7|Water|742.1| |752.3|Down|771.3|\nrow1: |24.0|API|35.6| |75.0|Well|89.3| |91.2|Name|110.1| |169.0|WellNum|198.7| |222.0|Compl.|245.6| |247.5|N|252.5| |262.0|Compl.|285.6| |287.5|Name|306.4| |329.4|FieldName|364.5| |390.6|WellState|421.9| |433.2|Prod|451.0| |453.1|Date|470.0| |502.9|Prod|518.7| |540.2|Prod|556.0| |577.2|Prod|593.0| |610.4|Press.|631.3| |648.9|Press.|669.8| |686.0|Choke|707.3| |723.8|Inject|742.0| |756.9|time|771.2|"
}

Example Run

python -m src.llm.input_llm --file-per-page --input-path ./data/preprocessed/test --output-path ./data/llm_output/test_trial --model-name google/gemini-2.5-pro --start-page 1 --end-page 5 --verbosity 5 --show-time

postprocessing

This folder handles the final transformation and cleanup of LLM-extracted outputs, combining page-level CSV text into structured tabular data. It validates, fixes, and merges all outputs into finalized DataFrames, ready for analysis or downstream use. It also manages retries for malformed pages and structured logging of all modifications. This is used in the pipeline module.

  • Handles retrial mechanism depending error in conversion to dataframes.
  • Handles logging of retries including llm_attempts_by_page, retried_pages, changed_rows_by_page and changed_pages.
  • Each page has an LLM attempt with a list of outputs and each output the associated problems.
  • Handles unexpected output like tags, triple quotes etc.
    • Functions: handle_unexpected_output
  • Fixes unexpected number of commas (empty columns and trailing columns). Also handles logging of changed rows.
    • Functions: fix_rows
  • Adjust total global row indices.
  • Logs global start and end row index of each page number.
  • Output pickle files with keys words, file_name, totals, operator, date, data and page_row_mapper.

Detailed Description

Entry Point: postprocess.py: Main entry point for orchestrating postprocessing of all PDFs, managing prompts, arguments, retries, and output persistence.

Files:

utils.py: Contains all helper functions for loading, cleaning, and combining LLM data. types.py: Defines core dataclasses and typed dictionaries used throughout postprocessing (e.g., LLMData, PageData, PostprocessedPDF, FixRowResult, etc.).

Key functions in utils.py:

get_llm_data: Loads and validates LLM output files and metadata for a given PDF directory. Ensures page-level JSON files are sorted and consistent.

fix_row: Cleans a single CSV row to match the expected number of columns, replacing empty cells with None and trimming trailing empties.

fix_rows: Applies fix_row across all rows, removing trailing empty rows and tracking which were modified.

get_safe_args: Filters a dictionary of arguments to retain only those compatible with a given function signature.

handle_unexpected_output: Cleans LLM responses by stripping Markdown code block syntax and normalizing irregular outputs.

postprocess_pdf: Aggregates LLM-generated page outputs into a unified DataFrame. Cleans, retries, and merges page data, aligning totals and recording detailed logs of all fixes and LLM retries.

remove_quotes: Cleans DataFrame columns by removing surrounding quotation marks from string values while preserving numeric data.

handle_postprocessing_args: Builds and parses standardized command-line arguments for postprocessing, supporting flexible prefixes and consistent CLI configuration.

strip_metadata_totals: Removes totals from metadata, returning a reduced version for combination with processed data.

postprocess_pdfs: Top-level driver that applies postprocess_pdf to all PDFs in a directory, merges results with metadata, and optionally saves outputs and logs per PDF.

Example Output

This is the json equivalent for easier portrayal.

{
   "words":[
   "\nrow25: |23.9|BRANDON|64.4| |66.7|12-5-6|90.1| |92.3|1H|102.7| |193.4|35051239440000|256.5| |269.2|2025-07-05|310.6| |316.4|Producing|352.9| |441.7|22.91|461.9| |496.4|81.00|516.7| |546.7|116.66|571.4| |598.4|871.68|623.2| |651.7|91.64|671.9| |704.9|0.00|720.7| |752.2|24.00|772.4|\nrow26: |23.9|BRANDON|64.4| |66.7|12-5-6|90.1| |92.3|1H|102.7| |193.4|35051239440000|256.5| |269.2|2025-07-04|310.6| |316.4|Producing|352.9| |441.7|26.25|461.9| |491.9|131.00|516.7| |546.7|120.00|571.4| |598.4|887.64|623.2| |651.7|96.37|671.9| |704.9|0.00|720.7| |752.2|24.00|772.4|",
 "row0: |452.3|Oil|462.7| |492.1|Gas|506.9|\nrow1: |24.6|Well|40.7| |42.9|Name|64.1| |194.9|API|207.9| |269.1|Prod|286.9| |289.1|Date|306.0| |317.2|Prod|334.9| |337.1|Status|360.9| |444.9|Prod|462.7| |489.1|Prod|506.9| |549.7|Water|571.4| |605.5|CSG|622.4| |655.5|TBG|671.9| |696.8|Choke|720.6| |748.6|HrsOn|772.4|\nrow2: |553.7|Prod|571.4| |603.3|Pres.|622.4| |652.8|Pres.|671.9|\nrow3: |23.9|BRANDON|64.4| |66.7|12-5-6|90.1| |92.3|1H|102.7| |193.4|35051239440000|256.5| |269.2|2025-07-03|310.6| |316.4|Producing|352.9|"
 ],
 "file_name": "PDSWDX-DP-camino-2025-07-30",
 "totals": [
            {"page_num": 0,
               "row_num_adjusted": 14,
               "global_row_number": 14,
               "numbers": [297.8, 465.95, 592.41]
            },
            {
               "page_num": 1,
               "row_num_adjusted": 12,
               "global_row_number": 28,
               "numbers": [440.81, 344.16, 863.54]},
            ],
 "operator": "camino",
 "date": "2025-07-30",
 "page_row_mapper": [
                     {"start_row": 0, "end_row": 16, "page_num": 1},
                     {"start_row": 16, "end_row": 40, "page_num": 2},
                     {"start_row": 40, "end_row": 62, "page_num": 3}
                     ],
   "data": "DATAFRAME"
}

Example Run

python -m src.postprocessing.postprocess --input-path ./data/llm_output/test --log-path ./data/llm_output/log/test --output-path ./data/postprocessed/test --verbosity 1 --show-time

evaluation/

This folder handles evaluation of postprocessed LLM output, providing live evaluation using rules and logic based on industry expert experience with the PDFs. This is used in the pipeline module.

  1. Format types to specified column types.
    • Functions: format_types
  2. Handle evaluations (explained below).
    • Functions: evaluate_dataframe
  3. Aggregates evaluation results and categorizes outcomes into discrete statuses such as Pass, Warning, Failure, etc.
    • Functions: summarize_evaluation
  4. Formats the summarization into a more reader-friendly format.
    • Functions: format_summary
  5. Final aggregation of the whole pdf conversion.
    • Functions: compute_evaluation_status
  6. Locates and specifies the line and page number for each problem.
    • Functions: locate_evaluation_failures

Detailed Description

Entry Points:

  • evaluation.py: Main entry point for evaluating the processed dataframes of each PDF after postprocessing.
  • comparison.py: Entry point for comparing a ground-truth CSV with the LLM-generated CSV output.

Files:

  • summarization.py: Contains functions that aggregate evaluation results and return overall flags, e.g., warning, failure, passed, or not ran.
  • locating_failures.py: Defines Problems that map detected issues to PDF pages, including page details and descriptions.
  • types.py: Contains TypedDicts and Pydantic models to properly define function arguments and return types.
  • utils.py: Helper functions supporting evaluation, comparison, and summarization workflows.

Key functions in utils.py:

Evaluations

  • format_types: Safely convert specified DataFrame columns to given data types and track per-cell failures.
  • column_checker: Check for missing columns in a DataFrame.
  • all_nan_columns_checker: Identify columns in a DataFrame that contain only NaN values.
  • find_invalid_api_length: Return rows where the API number length is invalid.
  • find_invalid_api_content: Return rows where the API contains characters other than digits or dashes.
  • sum_checker: Validate whether column sums in a DataFrame match precomputed totals.
  • duplicate_production_date_checker: Identify and summarize duplicate (primary key, date) pairs in a DataFrame.
  • check_primary_key_date_consistency: Check consistency of production dates across groups defined by a primary key.
  • data_frequency_checker: Analyze the distribution of date differences for each group.

Others

  • classify_dataframe: Classify a production DataFrame into high-level structural states.
  • evaluate_dataframe: Evaluate a production DataFrame across multiple quality, consistency, and formatting checks.
  • evaluate_dataframes: Evaluate multiple pickled DataFrames within a directory and summarize results.
  • handle_evaluation: Perform a full evaluation workflow for a production DataFrame.
  • save_evaluation_results: Save evaluation results and status summaries to a timestamped directory.
  • handle_evaluation_args: Parse and return standardized evaluation arguments.
  • compare_dfs_target_columns: Compare two DataFrames for column name and content accuracy.
  • handle_comparison_args: Parse and return standardized comparison arguments.

Example Run

Evaluation

python -m src.evaluation.evaluation --verbosity 5 --show-time --input-path ./data/postprocessed/dir --output-path ./data/evaluation/dir

Comparison

python -m src.evaluation.comparison --test-df-path ./data/postprocessed/pipeline_trial/sub_100/high_reasoning/2025-10-14_15-33-34 --ground-truth-df-path ./data/ground_truth/trial_sub_100/dfs --comparison-output-path ./data/comparison/trial_test

pipeline

This folder integrates the postprocessing and evaluation stages with LLM retrial into a unified execution pipeline. It automates the full end-to-end process of transforming LLM outputs into finalized structured tables, validating them against expected totals, retrying problematic pages, and saving detailed logs and evaluation summaries.

Detailed Description

Entry Point: main.py: Top-level script that initializes LLM and pipeline configurations, loads templates, and executes the combined postprocessing–evaluation pipeline across all PDFs. types.py: Defines data types and structures used for consistent information exchange across the pipeline.

Files:

utils.py: Contains all helper functions responsible for orchestrating postprocessing, evaluation, and retry logic. types.py: Defines structured type hints and dataclasses used throughout the pipeline (e.g., PostprocessEvaluateResults, ColumnConfig, MetaData, etc.).

Key functions in utils.py:

postprocess_evaluate: Core unified function that executes both postprocessing and evaluation for a single PDF. It merges LLM outputs into structured DataFrames, validates column sums and totals, identifies mismatches, and retries failed pages with refined prompts until convergence or retry limits are reached.

postprocess_evaluate_pdfs: Batch-level orchestrator that runs postprocess_evaluate for each PDF directory. It loads LLM outputs and metadata, performs retries, and writes results to disk (processed data, evaluation summaries, and logs). Designed for large-scale automated evaluation runs.

handle_pipeline_args: Parses and prepares standardized command-line arguments for the full pipeline. Handles path setup, timestamping, verbosity, and all constants (e.g., column configs, formats, and operator mappings). Returns a fully resolved argument dictionary for downstream use.

Example Run

python -m src.pipeline.main `
  --pipeline-folder-name sub_100 `
  --pipeline-input-path ./data/llm_output/test `
  --pipeline-output-path-postprocess ./data/postprocessed/test `
  --pipeline-output-path-evaluation ./data/evaluation/test `
  --pipeline-log-path ./data/log/test `
  --pipeline-timestamped `
  --pipeline-verbosity 5 `
  --pipeline-show-time

Example Output

Same output as the postprocessing stage.

utils

This folder provides shared utility components used across all modules (postprocessing, evaluation, metadata extraction, and pipeline). It centralizes constants, type definitions, verbose logging, and helper functions for clean, consistent, and maintainable workflows.

Files:

constants.py: Defines all static configuration values, mappings, and global constants used throughout the project (e.g., column names, operator identifiers, regex patterns, directory structures, and thresholds).

types.py: Contains all TypedDict, dataclass, and Pydantic model definitions that standardize data structures across modules. Defines type-safe schemas for metadata, evaluation results, pipeline configurations, and processed DataFrames.

utils.py: General-purpose helper functions for file I/O, data manipulation, JSON/CSV handling, path management, and validation logic. These utilities are imported across all stages of the pipeline.

verbose_printer.py: Implements a configurable logging utility that prints messages according to verbosity levels and timestamps (when enabled). Supports color-coded or prefixed outputs for clearer traceability across long multi-stage runs.

Run

Setup Models

This section is for the the detect_extract stage. This stage uses computer vision models in the process of table detection by Microsoft's table-transformer.

  1. Create folder named model at table_transformer\table_transformer
  2. Download pubtables1m_detection_detr_r18 and TATR-v1.1-All-msft from these links DETR R18 | TATR-v1.1-All.

Environments

This pipeline needs two different environment. This is due to the difference in usage. The detect_extract.yml is purely for the detect_extract stage and is based upon the microsoft table-transformer environment. The pipeline.yml is used for the rest of the stages in the whole pipeline (not only the pipeline module).

detect_extract

conda env create -f envs/detect_extract.yml
conda activate detect_extract

Then the table-transformer module is installed to be used as a package:

pip install ./table_transformer

Rest of pipeline

  1. Create virtual environment
python -m venv .venv
  1. Install dependencies using pip
pip install -r envs/requirements.txt

Commands

To run any file stated to be an entry point:
Make sure you are at the root directory

python -m src.dir.file_name --arg-1 val1 -- --arg-2 val2

Inspect arguments

All argument descriptions can be viewed using:

python -m src.path.to.file --help

Future Work

  • Aggregate same problems into a single problem with a range of row indices for more effiecent prompts.

    Instead of:

    - Row index: [20] — 'API' column should be either 10 or 12 or 14 characters
    - Row index: [21] — 'API' column should be either 10 or 12 or 14 characters
    - Row index: [22] — 'API' column should be either 10 or 12 or 14 characters

    Correct:

    - Row indices: [20,21,22] — 'API' column should be either 10 or 12 or 14 characters
  • When a problem occurs a certain number of times, use different LLM arguments (e.g., model, reasoning_effort, temperature) in the retrial mechanism.

Acknowledgements

This project is built on top of Microsoft Table Transformer by Microsoft.

Some modifications have been made to enable the project to be used smoothly as a package. A pull request with these changes has been submitted: PR link.

We thank the original authors for their work.

About

An end-to-end, automated PDF-to-CSV processing system leveraging computer vision, OCR, and large language models (LLMs) for highly accurate tabular data extraction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages