Skip to content

Staging Merge #37

Merged
findbhavin merged 25 commits intomainfrom
Staging
Jan 22, 2026
Merged

Staging Merge #37
findbhavin merged 25 commits intomainfrom
Staging

Conversation

@findbhavin
Copy link
Owner

No description provided.

Copilot AI and others added 25 commits January 21, 2026 15:24
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…cation

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…rocess

Extract article type from Word document and fix table column loss in HTML generation
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…uirements

Add missing python-docx dependency to requirements.txt
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…onal patterns

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Fix empty table rows and duplicate article type in HTML output
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Deleted DIRECT_PDF_CONVERSION.md (PDF functionality removed)
- Updated README.md: Removed all PDF references and examples
- Updated SETUP.md: Removed PDF dependencies (LibreOffice, WeasyPrint)
- Updated SUMMARY.md: Removed PDF-specific fixes
- Updated FIX_SUMMARY.md: Changed PDF errors to HTML errors
- Updated FINAL_IMPLEMENTATION_SUMMARY.md: Removed PDF conversion features
- Updated IMPLEMENTATION_SUMMARY.md: Removed PDF generation steps
- Updated VERIFICATION_REPORT.md: Removed PDF features and references
- Updated examples/README.md: Removed PDF examples section
- Updated Output files/README.txt: Removed PDF files from list
- Updated test file to remove PDF output expectations

The system now only generates JATS XML and HTML output.
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Fix HTML table generation and remove PDF functionality
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…ting

Remove legacy PDF references and enhance table readability
@findbhavin findbhavin merged commit e129eac into main Jan 22, 2026
4 of 8 checks passed
@gemini-code-assist
Copy link

Summary of Changes

Hello @findbhavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the OmniJAX conversion system to concentrate on high-quality JATS XML and HTML outputs by deprecating all PDF generation functionalities. This strategic change simplifies the architecture and reduces external library overhead. Concurrently, the HTML conversion has been significantly upgraded to provide more accurate and visually consistent results, notably through dynamic article title extraction and a robust solution for preserving complex table layouts that were previously problematic due to Pandoc's conversion quirks.

Highlights

  • Removal of PDF Generation: The core conversion pipeline and all related documentation have been updated to remove PDF generation capabilities, focusing solely on JATS XML and HTML outputs. This includes the removal of weasyprint, pydyf, and reportlab dependencies.
  • Enhanced HTML Generation: Significant improvements have been made to the HTML conversion process. This includes dynamically extracting article titles from the source DOCX document and robustly preserving table structures by fixing column preservation issues caused by Pandoc's JATS-to-HTML conversion.
  • Documentation and UI Updates: All internal and external documentation, including READMEs, summaries, and UI descriptions, have been thoroughly updated to reflect the shift from dual PDF output to enhanced HTML output. A new HTML_IMPROVEMENTS_SUMMARY.md file details these changes.
  • Dependency Changes: The project's dependencies have been streamlined by removing PDF-related libraries (weasyprint, pydyf, reportlab) and adding python-docx to facilitate direct DOCX content extraction for HTML improvements.
  • Improved XML Post-processing: The XML post-processing now includes logic to prevent empty <tbody> elements in tables for DTD compliance and removes duplicate article type paragraphs from the body, ensuring cleaner and more accurate XML output.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactoring by removing the PDF generation capabilities and focusing on improving the HTML output. The changes are extensive, touching code, tests, and documentation across the repository.

Key improvements include:

  • Dynamic extraction of the article type from the DOCX file to be used as the HTML title.
  • Robust post-processing of the generated HTML to fix table structure issues caused by Pandoc.
  • Smarter logic in the XML post-processing to avoid creating empty <tbody> elements in certain tables.
  • Removal of duplicate article type information from the document body.

The removal of PDF-related dependencies (weasyprint, pydyf, reportlab) and the addition of python-docx are consistent with these changes. New tests have been added to cover the new table and article type fixing logic, which is great.

Overall, this is a solid set of changes that streamlines the pipeline and improves the quality of the HTML output. I have a few minor suggestions to improve code clarity and fix a small issue in the documentation.

]

# Check first few paragraphs (up to 3) for article type markers
paragraphs_to_check = 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 3 is a magic number. For better readability and maintainability, please consider defining it as a constant at the top of the method, for example: NUM_PARAGRAPHS_TO_CHECK = 3.

if first_p_text and len(first_p_text) > 0:
uppercase_ratio = sum(1 for c in first_p_text if c.isupper()) / len(first_p_text)
# If 80%+ uppercase and short (likely metadata), remove it
if uppercase_ratio > 0.8 and len(first_p_text) < 100:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The values 0.8 and 100 are magic numbers. They should be defined as constants with descriptive names, like UPPERCASE_RATIO_THRESHOLD and MAX_METADATA_LIKE_LENGTH, at the top of the method. This improves readability and makes the logic easier to understand and modify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants