An end-to-end, automated PDF-to-CSV processing system leveraging computer vision, OCR, and large language models (LLMs) for highly accurate tabular data extraction.
This project is designed to eliminate manual QA bottlenecks, providing fully structured outputs with traceable logs and retry mechanisms to ensure 100% accuracy, reducing operational costs by 90% (~$100,000 annually).
Key features include:
- Robust PDF preprocessing and table detection using state-of-the-art CV methods.
- Precise OCR extraction of text with positional data.
- Structured data parsing and integration into LLM-driven CSV generation.
- Dynamic postprocessing, evaluation, and retry loops for automatic error correction.
- Detailed logging for real-time monitoring and traceability.
- End-to-end automation: From raw PDF detection to validated CSV outputs.
- Error resilience: Automatic retries for LLM outputs and conversion issues.
- Validation & QA: Dynamic checks at every stage, with detailed evaluation reports.
- Cost savings: Reduces manual verification effort by 90%, cutting operational costs.
- Traceable outputs: Timestamped logs for every processing step.
archive/
notebooks/
src/
detect_extract/
config.pydoc.mdrun_detect_extract.pyutils.py__init__.py
evaluation/
comparison.pyevaluation.pylocating_failures.pysummarization.pytypes.pyutils.py
get_metadata/
get_metadata.pyutils.py
Details
llm/
input_llm.pyutils.py
prompts/
description_factory.pynotes.mdsystem_prompt.j2user_prompt.j2
pipeline/
main.pytypes.pyutils.py
postprocessing/
postprocess.pytypes.pyutils.py
utils/
constants.pytypes.pyutils.pyverbose_printer.py__init__.py
The current architecture implements a fully automated document-to-dataset pipeline, starting from raw PDF detection to validated, structured outputs. Each stage builds on the previous one, ensuring robustness, traceability, and quality control through iterative retries and evaluation.
detect_extract → get_metadata → llm → pipeline
Description: The new architecture integrates postprocessing, retry logic, and evaluation within a unified pipeline. It introduces dynamic feedback loops to automatically recover from LLM or data inconsistencies.
-
detect_extract
- Detects and extracts table boundaries.
- Extracts raw words with positional data from PDFs.
- Formats words into lines.
- Outputs structured
.jsonwith keys:wordsandfile_name.
-
get_metadata
- Reads the files from
detect_extract. - Derives operator name and download date from
file_name. - Extracts and formats total values for downstream validation.
- Outputs structured
.jsonwith additional keys:totalsandoperator,date.
- Reads the files from
-
llm
- Converts extracted page data into structured CSV-like content.
- Uses a central prompt template, with operator specific entries.
- Saves initial LLM outputs for each PDF in
.jsonwith keys:wordsandpage_num.
-
pipeline
-
Central controller handling postprocessing, evaluation, and performance-based retrial mechanisms.
-
Executes the following decision loop:
-
Postprocess
- Handles structured data cleaning and unexpected outputs.
- Handles retrials for conversion to dataframe errors.
- Handles logging of any change from LLM raw output to structured CSV.
-
Evaluation
- Evaluates each pdf according to agreed upon validation checks (most prominently: total check).
- Flags and locates problems, narrowing down to specific lines in each page.
-
Retrial Mechanism
- If evaluation fails → triggers a selective LLM retry (page-level).
- Concatenates to prompt, the specific problem at each line and previous output.
- Re-runs postprocessing + evaluation for retried segments.
- Continues until all PDFs pass or maximum retry attempts reached.
- Outputs:
- Cleaned and verified postprocessed data.
- Evaluation reports and flags.
- Logs for every step, timestamped for traceability.
-
| Stage | Responsibility | Output | Error Handling |
|---|---|---|---|
| detect_extract | Extract words, tables, and positions from PDFs | Raw extracted JSON/text | Skips corrupted pages |
| get_metadata | Derive operator, date, and totals | Enriched metadata JSON | Logs missing/invalid files |
| llm | Generate structured tabular predictions | LLM page-level CSVs | Logs model failures |
| pipeline | Integrate postprocessing + evaluation + reruns | Final verified tables + reports | Retries failed pages automatically |
This is a detailed description of the each folder and file. Each stage is takes as input the previous stage's output, unless stated otherwise.
This folder handles table detection and text extraction from PDFs using the Microsoft Table Transformer.
- Load image using
fitz.- pdf2image had similar accuracy; however, fitz was ~25x faster and ~3x more memory efficient (supported by
notebooks/fitz_pdfplumber_pdf2image). - Functions:
extract_text_from_pdf
- pdf2image had similar accuracy; however, fitz was ~25x faster and ~3x more memory efficient (supported by
- Detects and extracts table boundaries using
Microsoft Table Transformer.- Functions:
extract_text_from_pdf
- Functions:
- Extracts raw words with positional data from PDFs using
fitz.- PDFPlumber had similar accuracy; however, fitz was ~150x faster (supported by
notebooks/fitz_pdfplumber_pdf2image). - Functions:
extract_text_from_pdf
- PDFPlumber had similar accuracy; however, fitz was ~150x faster (supported by
- Formats words into lines with separators (eg:
|) such that the content is more structured for LLM ingestion.- Functions:
reconstruct_text_add_location
- Functions:
- Outputs structured
.jsonwith keys:wordsandfile_name.- Output words is a list of string. Each index is a page, where each row is separated by a break line
\n.
- Output words is a list of string. Each index is a page, where each row is separated by a break line
Entry Point: run_detect_extract.py: run this script to extract and format text from the PDF tables.
config.py: Configures the Table Transformer pipeline, sets devices, relative paths, and builds absolute paths for models and configs.utils.py: Contains the main functions for detecting tables, extracting text, reconstructing lines, and adding locations in PDFs.run_detect_extract.py: Script to run table detection and extraction. Initializes the pipeline and callsprocess_pdfs.doc.md: Optional documentation or notes specific to this module.
scale_pdf_coord: Converts bounding box coordinates between PDF points and image pixels.extract_text_from_pdf: Detects table regions and extracts word tokens from PDFs.reconstruct_text_add_location: Reconstructs lines from extracted words with horizontal positions.process_pdfs: Processes one or multiple PDFs and optionally saves output to JSON.handle_detect_extract_args: Parses standardized command-line arguments for this module.
{
"words": [
"row0: |245.2|Prod|261.0| |291.0|Oil|300.3| |332.0|Oil|341.3| |367.2|Gas|380.3| |409.2|Gas|422.3| |442.3|Water|461.6| |479.5|Water|498.9| |513.4|Casing|536.6| |547.6|Tubing|570.8| |615.4|Down|634.3|\nrow1: |20.3|Well|34.6| |36.5|ID|43.5| |63.1|Well|77.3| |79.3|Name|98.2| |173.3|API|184.9| |246.1|Date|261.1| |284.5|Prod|300.3| |323.1|Sales|341.3| |364.4|Prod|380.3| |404.1|Sales|422.3| |445.7|Prod|461.5| |480.6|Inject|498.8| |521.5|Pres|536.6| |553.8|Pres.|570.8| |581.6|Choke|602.8| |620.0|time|634.3| |646.8|Downtime|680.0| |681.9|Reason|707.0|
\nrow2: |18.8|303969|39.7| |56.8|BINDER|80.4| |82.1|FED|94.7| |96.4|2-2HZ|113.7| |115.5|CLCDFH|140.8| |173.8|05123507300000|222.5| |221.3|2025-03-31|253.3| |285.0|44.46|300.7| |326.5|44.52|342.2| |361.0|479.46|380.2| |403.3|479.46|422.4| |451.0|8.94|463.2| |488.0|0.00|500.2| |518.5|317.07|537.7| |551.0|184.48|570.2| |584.0|162.00|603.2|",
"row0: |245.2|Prod|261.0| |291.0|Oil|300.3| |332.0|Oil|341.3| |367.2|Gas|380.3| |409.2|Gas|422.3| |442.3|Water|461.6| |479.5|Water|498.9| |513.4|Casing|536.6| |547.6|Tubing|570.8| |615.4|Down|634.3|\nrow1: |20.3|Well|34.6| |36.5|ID|43.5| |63.1|Well|77.3| |79.3|Name|98.2| |173.3|API|184.9| |246.1|Date|261.1| |284.5|Prod|300.3| |323.1|Sales|341.3| |364.4|Prod|380.3| |404.1|Sales|422.3| |445.7|Prod|461.5| |480.6|Inject|498.8| |521.5|Pres|536.6| |553.8|Pres.|570.8| |581.6|Choke|602.8| |620.0|time|634.3| |646.8|Downtime|680.0| |681.9|Reason|707.0|"
],
"file_name": "PDSWDX-DP-Anadarko-2025-07-31"
}python -m src.detect_extract.run_detect_extract --pages-to-detect 0 -1 --verbosity 0 --show-time --input-path ./data/raw --output-path ./data/extracted/no_detect_trialThis folder handles gathering of metadata. Operator and data of pdf download trivially extracted from file name. Total/Sums of columns are extracted from the preprocessed word (essential for live evaluation).
- Derives operator name and download date from
file_name.- Functions:
get_operator_download_date
- Functions:
- Extracts and formats total values for downstream validation.
- Functions:
get_totals
- Functions:
- Outputs structured
.jsonwithdetect_extractkeys +totalsandoperator,date.totalsis a dict withpage_num,row_num,row_num_adjusted(row_num - rows taken by columns),row_data(lines that contain 'total'), numbers (extracted numbers)
Entry Point:
get_metadata: Main entry point that gets metadata and saves it in an output directory.
utils.py: Contains all helper functions for metadata processing. get_metadata: Main entry point that gets metadata and saves it in an output directory.
process_metadata_files: Iterates over .json files in an input directory, updates each file with extracted operator, download date, and computed totals, and writes the enriched JSON to an output directory.
parse_number: Converts numeric strings (with commas) to floats; returns None if conversion fails.
get_operator_download_date: Extracts operator name and download date from a PDF file name following a specific naming convention.
get_totals: Identifies total rows in reconstructed PDF text, parsing numeric values and adjusting row indices for multi-line rows.
handle_get_metadata_args: Parses and returns standardized command-line arguments for metadata processing, including input/output directories and logging options.
extract_triplets: Extracts structured triplets from lines of text using regex patterns, typically for numeric and textual data.
"totals": [
{
"page_num": 68,
"row_num": 3,
"row_num_adjusted": 0,
"row_data": "row3: |311.9|Total:|335.7| |421.4|208,316.47|461.9| |470.2|3,128,216.28|517.4| |532.4|434,021.68|572.9|",
"numbers": [
208316.47,
3128216.28,
434021.68
]
}
],
"operator": "camino",
"date": "2025-07-30"python -m src.get_metadata.get_metadata --input-path ./data/extracted/test --output-path ./data/preprocessed/test --verbosity 5 --show-timeThis folder handles processing PDFs with LLMs, generating structured outputs from PDF text using system and user prompts. It supports page-level processing, including safe argument handling and prompt templating.
- Uses a central user and system prompt templates, with operator specific entries.
- Saves initial LLM outputs for each PDF in
.jsonwith keys:wordsandpage_num. - Each page is a different LLM request.
Entry Point:
input_llm.py: Main entry point for processing PDFs with LLMs, orchestrating prompt construction, page-level calls, and output saving.
utils.py: Contains all helper functions for LLM processing. prompts/: Stores prompt templates and related resources:
description_factory.py: Generates structured descriptions for prompts.notes.md: Notes on prompt design and usage.system_prompt.j2: Jinja2 template for system prompts.user_prompt.j2: Jinja2 template for user prompts.
get_llm_output_pdf: Processes a PDF document with an LLM, generating outputs for each page in a specified range and optionally saving results.
get_llm_output_page: Processes a single page using an LLM and returns structured output including page number and original words.
convert_to_messages: Converts message objects to OpenAI-compatible format (system, user, assistant).
get_llm_output_raw: Processes a PDF using the OpenRouter API, saving output per page and/or per PDF.
get_llm_output_page_raw: Processes a single page using OpenRouter API and returns structured output.
get_prompt_template: Retrieves a prompt template from a local file or LangSmith service.
get_prompt_template_jinja: Loads a Jinja2 template from a directory and compiles it for rendering.
load_path_default: Loads JSON configuration from a path, falling back to a default dictionary.
handle_llm_args: Sets up and parses command-line arguments for LLM processing, initializing clients and configurations.
get_llm_output_pdfs: Processes multiple PDFs from a directory or a single PDF file using an LLM with given prompt templates and configurations.
get_llm_output_page_wrapper: Wrapper for get_llm_output_page to handle prompt formatting, additional context, and previous outputs.
call_llm_with_safe_args: Calls an LLM function with filtered arguments, ensuring only valid parameters are passed.
get_safe_llm_args: Filters a dictionary of arguments to include only those accepted by a specific function.
{
"llm_output": "API,Well Name,Production Date,Oil Production,Gas Production,Water Production\n3305308768,\"A JOHNSON 5397 43-33 10B\",2025-08-14,7.25,98.53,21.25\n3305308768,\"A JOHNSON 5397 43-33 10B\",2025-08-13,6.63,110.91,20.00\n3305308768,\"A JOHNSON 5397 43-33 10B\",2025-08-12,5.59,102.44,23.07\n3305308768,\"A JOHNSON 5397 43-33 10B\",2025-08-11,8.85,101.89,22.04\n3305308768,\"A JOHNSON 5397 43-33 10B\",2025-08-10,7.45,103.58,21.11",
"page_num": 0,
"words": "row0: |509.5|Oil|518.7| |542.9|Gas|556.0| |573.8|Water|593.1| |608.1|Casing|631.3| |646.5|Tubing|669.7| |722.7|Water|742.1| |752.3|Down|771.3|\nrow1: |24.0|API|35.6| |75.0|Well|89.3| |91.2|Name|110.1| |169.0|WellNum|198.7| |222.0|Compl.|245.6| |247.5|N|252.5| |262.0|Compl.|285.6| |287.5|Name|306.4| |329.4|FieldName|364.5| |390.6|WellState|421.9| |433.2|Prod|451.0| |453.1|Date|470.0| |502.9|Prod|518.7| |540.2|Prod|556.0| |577.2|Prod|593.0| |610.4|Press.|631.3| |648.9|Press.|669.8| |686.0|Choke|707.3| |723.8|Inject|742.0| |756.9|time|771.2|"
}python -m src.llm.input_llm --file-per-page --input-path ./data/preprocessed/test --output-path ./data/llm_output/test_trial --model-name google/gemini-2.5-pro --start-page 1 --end-page 5 --verbosity 5 --show-timeThis folder handles the final transformation and cleanup of LLM-extracted outputs, combining page-level CSV text into structured tabular data. It validates, fixes, and merges all outputs into finalized DataFrames, ready for analysis or downstream use. It also manages retries for malformed pages and structured logging of all modifications. This is used in the pipeline module.
- Handles retrial mechanism depending error in conversion to dataframes.
- Handles logging of retries including
llm_attempts_by_page,retried_pages,changed_rows_by_pageandchanged_pages. - Each page has an LLM attempt with a list of
outputsand each output the associatedproblems. - Handles unexpected output like tags, triple quotes etc.
- Functions:
handle_unexpected_output
- Functions:
- Fixes unexpected number of commas (empty columns and trailing columns). Also handles logging of changed rows.
- Functions:
fix_rows
- Functions:
- Adjust total global row indices.
- Logs global start and end row index of each page number.
- Output
picklefiles with keyswords,file_name,totals,operator,date,dataandpage_row_mapper.
Entry Point:
postprocess.py: Main entry point for orchestrating postprocessing of all PDFs, managing prompts, arguments, retries, and output persistence.
utils.py: Contains all helper functions for loading, cleaning, and combining LLM data.
types.py: Defines core dataclasses and typed dictionaries used throughout postprocessing (e.g., LLMData, PageData, PostprocessedPDF, FixRowResult, etc.).
get_llm_data: Loads and validates LLM output files and metadata for a given PDF directory. Ensures page-level JSON files are sorted and consistent.
fix_row: Cleans a single CSV row to match the expected number of columns, replacing empty cells with None and trimming trailing empties.
fix_rows: Applies fix_row across all rows, removing trailing empty rows and tracking which were modified.
get_safe_args: Filters a dictionary of arguments to retain only those compatible with a given function signature.
handle_unexpected_output: Cleans LLM responses by stripping Markdown code block syntax and normalizing irregular outputs.
postprocess_pdf: Aggregates LLM-generated page outputs into a unified DataFrame. Cleans, retries, and merges page data, aligning totals and recording detailed logs of all fixes and LLM retries.
remove_quotes: Cleans DataFrame columns by removing surrounding quotation marks from string values while preserving numeric data.
handle_postprocessing_args: Builds and parses standardized command-line arguments for postprocessing, supporting flexible prefixes and consistent CLI configuration.
strip_metadata_totals: Removes totals from metadata, returning a reduced version for combination with processed data.
postprocess_pdfs: Top-level driver that applies postprocess_pdf to all PDFs in a directory, merges results with metadata, and optionally saves outputs and logs per PDF.
This is the json equivalent for easier portrayal.
{
"words":[
"\nrow25: |23.9|BRANDON|64.4| |66.7|12-5-6|90.1| |92.3|1H|102.7| |193.4|35051239440000|256.5| |269.2|2025-07-05|310.6| |316.4|Producing|352.9| |441.7|22.91|461.9| |496.4|81.00|516.7| |546.7|116.66|571.4| |598.4|871.68|623.2| |651.7|91.64|671.9| |704.9|0.00|720.7| |752.2|24.00|772.4|\nrow26: |23.9|BRANDON|64.4| |66.7|12-5-6|90.1| |92.3|1H|102.7| |193.4|35051239440000|256.5| |269.2|2025-07-04|310.6| |316.4|Producing|352.9| |441.7|26.25|461.9| |491.9|131.00|516.7| |546.7|120.00|571.4| |598.4|887.64|623.2| |651.7|96.37|671.9| |704.9|0.00|720.7| |752.2|24.00|772.4|",
"row0: |452.3|Oil|462.7| |492.1|Gas|506.9|\nrow1: |24.6|Well|40.7| |42.9|Name|64.1| |194.9|API|207.9| |269.1|Prod|286.9| |289.1|Date|306.0| |317.2|Prod|334.9| |337.1|Status|360.9| |444.9|Prod|462.7| |489.1|Prod|506.9| |549.7|Water|571.4| |605.5|CSG|622.4| |655.5|TBG|671.9| |696.8|Choke|720.6| |748.6|HrsOn|772.4|\nrow2: |553.7|Prod|571.4| |603.3|Pres.|622.4| |652.8|Pres.|671.9|\nrow3: |23.9|BRANDON|64.4| |66.7|12-5-6|90.1| |92.3|1H|102.7| |193.4|35051239440000|256.5| |269.2|2025-07-03|310.6| |316.4|Producing|352.9|"
],
"file_name": "PDSWDX-DP-camino-2025-07-30",
"totals": [
{"page_num": 0,
"row_num_adjusted": 14,
"global_row_number": 14,
"numbers": [297.8, 465.95, 592.41]
},
{
"page_num": 1,
"row_num_adjusted": 12,
"global_row_number": 28,
"numbers": [440.81, 344.16, 863.54]},
],
"operator": "camino",
"date": "2025-07-30",
"page_row_mapper": [
{"start_row": 0, "end_row": 16, "page_num": 1},
{"start_row": 16, "end_row": 40, "page_num": 2},
{"start_row": 40, "end_row": 62, "page_num": 3}
],
"data": "DATAFRAME"
}python -m src.postprocessing.postprocess --input-path ./data/llm_output/test --log-path ./data/llm_output/log/test --output-path ./data/postprocessed/test --verbosity 1 --show-timeThis folder handles evaluation of postprocessed LLM output, providing live evaluation using rules and logic based on industry expert experience with the PDFs. This is used in the pipeline module.
- Format types to specified column types.
- Functions:
format_types
- Functions:
- Handle evaluations (explained below).
- Functions:
evaluate_dataframe
- Functions:
- Aggregates evaluation results and categorizes outcomes into discrete statuses such as Pass, Warning, Failure, etc.
- Functions:
summarize_evaluation
- Functions:
- Formats the summarization into a more reader-friendly format.
- Functions:
format_summary
- Functions:
- Final aggregation of the whole pdf conversion.
- Functions:
compute_evaluation_status
- Functions:
- Locates and specifies the line and page number for each problem.
- Functions:
locate_evaluation_failures
- Functions:
Entry Points:
evaluation.py: Main entry point for evaluating the processed dataframes of each PDF after postprocessing.comparison.py: Entry point for comparing a ground-truth CSV with the LLM-generated CSV output.
summarization.py: Contains functions that aggregate evaluation results and return overall flags, e.g.,warning,failure,passed, ornot ran.locating_failures.py: DefinesProblems that map detected issues to PDF pages, including page details and descriptions.types.py: ContainsTypedDicts andPydanticmodels to properly define function arguments and return types.utils.py: Helper functions supporting evaluation, comparison, and summarization workflows.
- format_types: Safely convert specified DataFrame columns to given data types and track per-cell failures.
- column_checker: Check for missing columns in a DataFrame.
- all_nan_columns_checker: Identify columns in a DataFrame that contain only NaN values.
- find_invalid_api_length: Return rows where the API number length is invalid.
- find_invalid_api_content: Return rows where the API contains characters other than digits or dashes.
- sum_checker: Validate whether column sums in a DataFrame match precomputed totals.
- duplicate_production_date_checker: Identify and summarize duplicate (primary key, date) pairs in a DataFrame.
- check_primary_key_date_consistency: Check consistency of production dates across groups defined by a primary key.
- data_frequency_checker: Analyze the distribution of date differences for each group.
- classify_dataframe: Classify a production DataFrame into high-level structural states.
- evaluate_dataframe: Evaluate a production DataFrame across multiple quality, consistency, and formatting checks.
- evaluate_dataframes: Evaluate multiple pickled DataFrames within a directory and summarize results.
- handle_evaluation: Perform a full evaluation workflow for a production DataFrame.
- save_evaluation_results: Save evaluation results and status summaries to a timestamped directory.
- handle_evaluation_args: Parse and return standardized evaluation arguments.
- compare_dfs_target_columns: Compare two DataFrames for column name and content accuracy.
- handle_comparison_args: Parse and return standardized comparison arguments.
Evaluation
python -m src.evaluation.evaluation --verbosity 5 --show-time --input-path ./data/postprocessed/dir --output-path ./data/evaluation/dirComparison
python -m src.evaluation.comparison --test-df-path ./data/postprocessed/pipeline_trial/sub_100/high_reasoning/2025-10-14_15-33-34 --ground-truth-df-path ./data/ground_truth/trial_sub_100/dfs --comparison-output-path ./data/comparison/trial_testThis folder integrates the postprocessing and evaluation stages with LLM retrial into a unified execution pipeline. It automates the full end-to-end process of transforming LLM outputs into finalized structured tables, validating them against expected totals, retrying problematic pages, and saving detailed logs and evaluation summaries.
Entry Point:
main.py: Top-level script that initializes LLM and pipeline configurations, loads templates, and executes the combined postprocessing–evaluation pipeline across all PDFs.
types.py: Defines data types and structures used for consistent information exchange across the pipeline.
utils.py: Contains all helper functions responsible for orchestrating postprocessing, evaluation, and retry logic.
types.py: Defines structured type hints and dataclasses used throughout the pipeline (e.g., PostprocessEvaluateResults, ColumnConfig, MetaData, etc.).
postprocess_evaluate: Core unified function that executes both postprocessing and evaluation for a single PDF. It merges LLM outputs into structured DataFrames, validates column sums and totals, identifies mismatches, and retries failed pages with refined prompts until convergence or retry limits are reached.
postprocess_evaluate_pdfs: Batch-level orchestrator that runs postprocess_evaluate for each PDF directory. It loads LLM outputs and metadata, performs retries, and writes results to disk (processed data, evaluation summaries, and logs). Designed for large-scale automated evaluation runs.
handle_pipeline_args: Parses and prepares standardized command-line arguments for the full pipeline. Handles path setup, timestamping, verbosity, and all constants (e.g., column configs, formats, and operator mappings). Returns a fully resolved argument dictionary for downstream use.
python -m src.pipeline.main `
--pipeline-folder-name sub_100 `
--pipeline-input-path ./data/llm_output/test `
--pipeline-output-path-postprocess ./data/postprocessed/test `
--pipeline-output-path-evaluation ./data/evaluation/test `
--pipeline-log-path ./data/log/test `
--pipeline-timestamped `
--pipeline-verbosity 5 `
--pipeline-show-timeSame output as the postprocessing stage.
This folder provides shared utility components used across all modules (postprocessing, evaluation, metadata extraction, and pipeline). It centralizes constants, type definitions, verbose logging, and helper functions for clean, consistent, and maintainable workflows.
Files:
constants.py: Defines all static configuration values, mappings, and global constants used throughout the project (e.g., column names, operator identifiers, regex patterns, directory structures, and thresholds).
types.py: Contains all TypedDict, dataclass, and Pydantic model definitions that standardize data structures across modules. Defines type-safe schemas for metadata, evaluation results, pipeline configurations, and processed DataFrames.
utils.py: General-purpose helper functions for file I/O, data manipulation, JSON/CSV handling, path management, and validation logic. These utilities are imported across all stages of the pipeline.
verbose_printer.py: Implements a configurable logging utility that prints messages according to verbosity levels and timestamps (when enabled). Supports color-coded or prefixed outputs for clearer traceability across long multi-stage runs.
This section is for the the detect_extract stage. This stage uses computer vision models in the process of table detection by Microsoft's table-transformer.
- Create folder named
modelattable_transformer\table_transformer - Download
pubtables1m_detection_detr_r18andTATR-v1.1-All-msftfrom these links DETR R18 | TATR-v1.1-All.
This pipeline needs two different environment. This is due to the difference in usage. The detect_extract.yml is purely for the detect_extract stage and is based upon the microsoft table-transformer environment. The pipeline.yml is used for the rest of the stages in the whole pipeline (not only the pipeline module).
conda env create -f envs/detect_extract.ymlconda activate detect_extractThen the table-transformer module is installed to be used as a package:
pip install ./table_transformer- Create virtual environment
python -m venv .venv- Install dependencies using pip
pip install -r envs/requirements.txtTo run any file stated to be an entry point:
Make sure you are at the root directory
python -m src.dir.file_name --arg-1 val1 -- --arg-2 val2All argument descriptions can be viewed using:
python -m src.path.to.file --help-
Aggregate same problems into a single problem with a range of row indices for more effiecent prompts.
Instead of:
- Row index: [20] — 'API' column should be either 10 or 12 or 14 characters - Row index: [21] — 'API' column should be either 10 or 12 or 14 characters - Row index: [22] — 'API' column should be either 10 or 12 or 14 characters
Correct:
- Row indices: [20,21,22] — 'API' column should be either 10 or 12 or 14 characters -
When a problem occurs a certain number of times, use different LLM arguments (e.g., model, reasoning_effort, temperature) in the retrial mechanism.
This project is built on top of Microsoft Table Transformer by Microsoft.
Some modifications have been made to enable the project to be used smoothly as a package. A pull request with these changes has been submitted: PR link.
We thank the original authors for their work.
