Conversation
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…cation Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…rocess Extract article type from Word document and fix table column loss in HTML generation
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…uirements Add missing python-docx dependency to requirements.txt
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…onal patterns Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Fix empty table rows and duplicate article type in HTML output
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Deleted DIRECT_PDF_CONVERSION.md (PDF functionality removed) - Updated README.md: Removed all PDF references and examples - Updated SETUP.md: Removed PDF dependencies (LibreOffice, WeasyPrint) - Updated SUMMARY.md: Removed PDF-specific fixes - Updated FIX_SUMMARY.md: Changed PDF errors to HTML errors - Updated FINAL_IMPLEMENTATION_SUMMARY.md: Removed PDF conversion features - Updated IMPLEMENTATION_SUMMARY.md: Removed PDF generation steps - Updated VERIFICATION_REPORT.md: Removed PDF features and references - Updated examples/README.md: Removed PDF examples section - Updated Output files/README.txt: Removed PDF files from list - Updated test file to remove PDF output expectations The system now only generates JATS XML and HTML output.
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Fix HTML table generation and remove PDF functionality
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…ting Remove legacy PDF references and enhance table readability
Summary of ChangesHello @findbhavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the OmniJAX conversion system to concentrate on high-quality JATS XML and HTML outputs by deprecating all PDF generation functionalities. This strategic change simplifies the architecture and reduces external library overhead. Concurrently, the HTML conversion has been significantly upgraded to provide more accurate and visually consistent results, notably through dynamic article title extraction and a robust solution for preserving complex table layouts that were previously problematic due to Pandoc's conversion quirks. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring by removing the PDF generation capabilities and focusing on improving the HTML output. The changes are extensive, touching code, tests, and documentation across the repository.
Key improvements include:
- Dynamic extraction of the article type from the DOCX file to be used as the HTML title.
- Robust post-processing of the generated HTML to fix table structure issues caused by Pandoc.
- Smarter logic in the XML post-processing to avoid creating empty
<tbody>elements in certain tables. - Removal of duplicate article type information from the document body.
The removal of PDF-related dependencies (weasyprint, pydyf, reportlab) and the addition of python-docx are consistent with these changes. New tests have been added to cover the new table and article type fixing logic, which is great.
Overall, this is a solid set of changes that streamlines the pipeline and improves the quality of the HTML output. I have a few minor suggestions to improve code clarity and fix a small issue in the documentation.
| ] | ||
|
|
||
| # Check first few paragraphs (up to 3) for article type markers | ||
| paragraphs_to_check = 3 |
| if first_p_text and len(first_p_text) > 0: | ||
| uppercase_ratio = sum(1 for c in first_p_text if c.isupper()) / len(first_p_text) | ||
| # If 80%+ uppercase and short (likely metadata), remove it | ||
| if uppercase_ratio > 0.8 and len(first_p_text) < 100: |
There was a problem hiding this comment.
No description provided.