Staging Merge by findbhavin · Pull Request #37 · findbhavin/OmniFormat_XML_JATS_PMD

findbhavin · 2026-01-22T03:25:19Z

No description provided.

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…cation Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…rocess Extract article type from Word document and fix table column loss in HTML generation

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…uirements Add missing python-docx dependency to requirements.txt

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…onal patterns Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Fix empty table rows and duplicate article type in HTML output

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

- Deleted DIRECT_PDF_CONVERSION.md (PDF functionality removed) - Updated README.md: Removed all PDF references and examples - Updated SETUP.md: Removed PDF dependencies (LibreOffice, WeasyPrint) - Updated SUMMARY.md: Removed PDF-specific fixes - Updated FIX_SUMMARY.md: Changed PDF errors to HTML errors - Updated FINAL_IMPLEMENTATION_SUMMARY.md: Removed PDF conversion features - Updated IMPLEMENTATION_SUMMARY.md: Removed PDF generation steps - Updated VERIFICATION_REPORT.md: Removed PDF features and references - Updated examples/README.md: Removed PDF examples section - Updated Output files/README.txt: Removed PDF files from list - Updated test file to remove PDF output expectations The system now only generates JATS XML and HTML output.

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Fix HTML table generation and remove PDF functionality

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…ting Remove legacy PDF references and enhance table readability

gemini-code-assist · 2026-01-22T03:25:54Z

Summary of Changes

Hello @findbhavin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the OmniJAX conversion system to concentrate on high-quality JATS XML and HTML outputs by deprecating all PDF generation functionalities. This strategic change simplifies the architecture and reduces external library overhead. Concurrently, the HTML conversion has been significantly upgraded to provide more accurate and visually consistent results, notably through dynamic article title extraction and a robust solution for preserving complex table layouts that were previously problematic due to Pandoc's conversion quirks.

Highlights

Removal of PDF Generation: The core conversion pipeline and all related documentation have been updated to remove PDF generation capabilities, focusing solely on JATS XML and HTML outputs. This includes the removal of weasyprint, pydyf, and reportlab dependencies.
Enhanced HTML Generation: Significant improvements have been made to the HTML conversion process. This includes dynamically extracting article titles from the source DOCX document and robustly preserving table structures by fixing column preservation issues caused by Pandoc's JATS-to-HTML conversion.
Documentation and UI Updates: All internal and external documentation, including READMEs, summaries, and UI descriptions, have been thoroughly updated to reflect the shift from dual PDF output to enhanced HTML output. A new HTML_IMPROVEMENTS_SUMMARY.md file details these changes.
Dependency Changes: The project's dependencies have been streamlined by removing PDF-related libraries (weasyprint, pydyf, reportlab) and adding python-docx to facilitate direct DOCX content extraction for HTML improvements.
Improved XML Post-processing: The XML post-processing now includes logic to prevent empty <tbody> elements in tables for DTD compliance and removes duplicate article type paragraphs from the body, ensuring cleaner and more accurate XML output.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant refactoring by removing the PDF generation capabilities and focusing on improving the HTML output. The changes are extensive, touching code, tests, and documentation across the repository.

Key improvements include:

Dynamic extraction of the article type from the DOCX file to be used as the HTML title.
Robust post-processing of the generated HTML to fix table structure issues caused by Pandoc.
Smarter logic in the XML post-processing to avoid creating empty <tbody> elements in certain tables.
Removal of duplicate article type information from the document body.

The removal of PDF-related dependencies (weasyprint, pydyf, reportlab) and the addition of python-docx are consistent with these changes. New tests have been added to cover the new table and article type fixing logic, which is great.

Overall, this is a solid set of changes that streamlines the pipeline and improves the quality of the HTML output. I have a few minor suggestions to improve code clarity and fix a small issue in the documentation.

gemini-code-assist · 2026-01-22T03:27:53Z

MasterPipeline.py

+                ]
+
+                # Check first few paragraphs (up to 3) for article type markers
+                paragraphs_to_check = 3


The value 3 is a magic number. For better readability and maintainability, please consider defining it as a constant at the top of the method, for example: NUM_PARAGRAPHS_TO_CHECK = 3.

gemini-code-assist · 2026-01-22T03:27:53Z

MasterPipeline.py

+                        if first_p_text and len(first_p_text) > 0:
+                            uppercase_ratio = sum(1 for c in first_p_text if c.isupper()) / len(first_p_text)
+                            # If 80%+ uppercase and short (likely metadata), remove it
+                            if uppercase_ratio > 0.8 and len(first_p_text) < 100:


The values 0.8 and 100 are magic numbers. They should be defined as constants with descriptive names, like UPPERCASE_RATIO_THRESHOLD and MAX_METADATA_LIKE_LENGTH, at the top of the method. This improves readability and makes the logic easier to understand and modify.

Copilot AI and others added 25 commits January 21, 2026 15:24

Initial plan

f30f5a4

Implement article type extraction and table column preservation

e0d0d0b

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Update .gitignore to exclude .deb files

84494ea

Address code review feedback: extract constants and reduce code dupli…

e7a9b8e

…cation Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Add comprehensive documentation for HTML improvements

0603ad9

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Merge pull request #32 from findbhavin/copilot/update-html-creation-p…

dca3088

…rocess Extract article type from Word document and fix table column loss in HTML generation

Initial plan

956700a

Add python-docx==1.2.0 to requirements.txt to fix ModuleNotFoundError

c7c03fd

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Merge pull request #33 from findbhavin/copilot/add-python-docx-to-req…

7822dca

…uirements Add missing python-docx dependency to requirements.txt

Initial plan

d607172

Fix table empty rows and duplicate article type issues

85bcb08

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Add comprehensive tests for table and article-type fixes

4a9993a

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Enhance article type removal to handle multiple paragraphs and additi…

07c335a

…onal patterns Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Fix potential division by zero in article type removal logic

cd8238e

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Merge pull request #34 from findbhavin/copilot/fix-html-table-issues

53efb51

Fix empty table rows and duplicate article type in HTML output

Initial plan

6a85424

Fix HTML table generation - handle empty HTML tables from Pandoc

2dcf23f

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Remove PDF generation code from MasterPipeline.py and app.py

c808d8a

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Remove PDF-related files, dependencies, and test references

97b8903

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Address code review comments - move import and improve documentation

9444e77

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Merge pull request #35 from findbhavin/copilot/fix-html-table-contents

4ee4d02

Fix HTML table generation and remove PDF functionality

Initial plan

565d219

Update index page, README, and improve table formatting

0ec8670

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Merge pull request #36 from findbhavin/copilot/update-docs-and-format…

74e58b9

…ting Remove legacy PDF references and enhance table readability

findbhavin merged commit e129eac into main Jan 22, 2026
4 of 8 checks passed

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staging Merge #37

Staging Merge #37
findbhavin merged 25 commits intomainfrom
Staging

findbhavin commented Jan 22, 2026

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

findbhavin commented Jan 22, 2026

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants