Skip to content

Staging#45

Merged
findbhavin merged 36 commits intomainfrom
Staging
Jan 22, 2026
Merged

Staging#45
findbhavin merged 36 commits intomainfrom
Staging

Conversation

@findbhavin
Copy link
Owner

No description provided.

Copilot AI and others added 30 commits January 22, 2026 05:10
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…ijax

Add GCS persistence for AI-native data collection
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…idden row

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
… for empty td.text

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…rrors

Fix table DTD validation: enforce element order and tbody requirement
…e rows from HTML

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
[WIP] Fix DTD validation error for tr element class attribute
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…cross-platform paths

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…ors-again

Strip data-* attributes for DTD/XSD compliance and enhance table CSS
- Created tools/fetch_conversion.py for fetching conversion details by ID
- Added GET /conversion/<conversion_id> API endpoint in app.py
- Updated README.md with conversion ID format and debugging documentation
- Script supports GCS file download and metrics retrieval

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Add unique id attributes to tex-math elements (texmath1, texmath2, etc.)
- Add unique id attributes to mml:math elements (mml1, mml2, etc.)
- Convert named-content anchors to proper ref elements in ref-list
- Update xref rid attributes to lowercase (ref1 instead of Ref1)
- Extract and preserve reference text from named-content context
- Remove orphaned empty named-content elements
- Add comprehensive test suite for DTD fixes

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Remove redundant re module import in MasterPipeline.py
- Fix formatting issues in fetch_conversion.py for numeric fields
- Replace deprecated datetime.utcnow() with datetime.now()
- Add proper type checking before applying numeric formatting

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Create comprehensive IMPLEMENTATION_DETAILS.md documentation
- Document conversion ID format and usage
- Explain DTD validation fixes in detail
- Include before/after examples
- Add test coverage summary
- Document impact and usage guidelines

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Add conversion ID lookup and fix PMC DTD validation errors
- Implement _fix_tex_math_citations() method to convert problematic tex-math elements containing citation superscripts to PMC-compliant xref elements
- Handle single citations (^{5}), multiple citations (^{5,6}), and range citations (^{1-3})
- Preserve leading punctuation and tail text
- Create placeholder ref elements in ref-list for cited references
- Add comprehensive unit tests for all citation patterns
- Integration with _post_process_xml() method

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Copilot AI and others added 6 commits January 22, 2026 12:52
- Add comprehensive integration tests for PMC compliance
- Test exact problematic pattern from issue (Line 515: .^{5,6})
- Verify mixed content handling (citations + real formulas)
- Add PMC Style Checker validation test (skipped if xsltproc not available)
- All tests passing successfully

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Document problem, solution, implementation details
- Include before/after examples
- List all test coverage
- Verify all acceptance criteria met
- Add references to PMC guidelines and JATS documentation

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Complete end-to-end validation showing all features working
- Implementation summary with before/after examples
- All acceptance criteria met and verified
- Ready for code review and merge

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…rors

Fix PMC Style Check errors: Convert tex-math citation fragments to xref elements
@gemini-code-assist
Copy link

Summary of Changes

Hello @findbhavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces several key enhancements to the OmniFormat XML JATS PMD conversion pipeline. It focuses on improving DTD compliance, adding Google Cloud Storage integration for data persistence, and providing debugging tools. The changes ensure that the generated XML files meet PMC submission requirements and improve the overall robustness and maintainability of the conversion process.

Highlights

  • DTD Validation Fixes: Addresses DTD validation errors by stripping invalid data attributes, adding missing IDs to math elements, and converting named-content references to proper ref elements.
  • GCS Integration: Adds Google Cloud Storage support for persisting application data, including input files, output packages, and conversion metrics, with non-blocking operations.
  • Conversion ID Lookup Utility: Introduces a conversion ID lookup utility via a standalone script and API endpoint to facilitate debugging and retrieval of conversion details.
  • tex-math Citation Conversion Fix: Fixes an issue where citation numbers in superscript are incorrectly interpreted as LaTeX math expressions, converting them to proper PMC-compliant xref elements.
  • HTML Table DTD Validation Errors: Fixes HTML table DTD validation errors by reordering table elements and enforcing tbody requirements, ensuring compliance with JATS DTD structure.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@findbhavin findbhavin merged commit 40d58d2 into main Jan 22, 2026
0 of 4 checks passed
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces significant improvements across several areas, including robust Google Cloud Storage (GCS) integration for persistence and enhanced debugging capabilities with a new conversion ID lookup utility. Crucially, it addresses multiple DTD validation errors, particularly concerning tex-math and mml:math ID attributes, and the conversion of named-content to proper ref elements. The changes also include a comprehensive fix for HTML table DTD compliance, ensuring correct element ordering and the presence of tbody elements. However, there are inconsistencies in the documentation and example outputs regarding the handling of data-dtd-compliance attributes versus class="dtd-compliance-row" which need to be resolved to accurately reflect the implementation.

I am having trouble creating individual review comments. Click here to see my feedback.

templates/style.css (82-84)

critical

This CSS rule targets tr.dtd-compliance-row. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true" on the XML element, not a class attribute. The _rebuild_table_section method (lines 2166-2168) also explicitly filters out rows with data-dtd-compliance="true" from the HTML output. This means these rows should not appear in the HTML at all, rendering this CSS rule ineffective for its stated purpose. Please update the Python code to set the class attribute if this styling is intended, or remove this CSS rule if the rows are filtered out entirely.

TABLE_DTD_FIX_DOCUMENTATION.md (82-84)

high

This CSS rule targets tr.dtd-compliance-row. As noted in previous comments, the Python code sets data-dtd-compliance="true", not a class attribute. This CSS rule will not apply to the elements generated by the code. Please update the CSS rule to tr[data-dtd-compliance="true"] or ensure the Python code sets the class attribute if this styling is intended.

PR_TABLE_FIX_SUMMARY.md (70)

high

This example XML output shows class="dtd-compliance-row". This is inconsistent with the MasterPipeline.py implementation (line 1246) which sets data-dtd-compliance="true". Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). Please update this example to accurately reflect the attribute being set by the code and clarify its presence in the final XML.

IMPLEMENTATION_DETAILS.md (171-181)

high

The code snippet provided in the "Key Code" section is incomplete and contains undefined functions and variables (e.g., refs, extract_reference_text, create_ref_element, update_xrefs, remove_named_content). This makes the example misleading and unhelpful for understanding the implementation. Please provide a complete and accurate code example or refer to the actual implementation in MasterPipeline.py.

TABLE_DTD_FIX_DOCUMENTATION.md (180-182)

high

This example XML output shows class="dtd-compliance-row". This is inconsistent with the MasterPipeline.py implementation (line 1246) which sets data-dtd-compliance="true". Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). Please update this example to accurately reflect the attribute being set by the code and clarify its presence in the final XML.

Output files/article.xml (70-72)

high

This example XML output shows class="dtd-compliance-row" for the placeholder <tr> element. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true", not a class attribute. Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). This example output is inconsistent with the actual implementation and the stated stripping logic. Please update this example to reflect the correct attribute (data-dtd-compliance="true") and clarify if these attributes are intended to be present in the final XML or stripped.

Output files/article.xml (299-301)

high

Similar to the previous comment, this example XML output shows class="dtd-compliance-row" which contradicts the implementation setting data-dtd-compliance="true" and the subsequent stripping of data-* attributes. Please ensure example outputs accurately reflect the code's behavior.

TABLE_DTD_FIX_DOCUMENTATION.md (141-143)

high

This example XML output shows class="dtd-compliance-row". This is inconsistent with the MasterPipeline.py implementation (line 1246) which sets data-dtd-compliance="true". Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). Please update this example to accurately reflect the attribute being set by the code and clarify its presence in the final XML.

Output files/article.xml (443-445)

high

This example XML output shows class="dtd-compliance-row" which contradicts the implementation setting data-dtd-compliance="true" and the subsequent stripping of data-* attributes. Please ensure example outputs accurately reflect the code's behavior.

PR_TABLE_FIX_SUMMARY.md (32-34)

high

This CSS rule targets tr.dtd-compliance-row. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true" on the XML element, not a class attribute. The _rebuild_table_section method (lines 2166-2168) also filters based on data-dtd-compliance="true". This means the CSS rule as written here will not apply to the rows generated by the Python code. Please update the CSS rule to tr[data-dtd-compliance="true"] or ensure the Python code sets the class attribute if this styling is intended.

Output files/article.xml (602-604)

high

This example XML output shows class="dtd-compliance-row" which contradicts the implementation setting data-dtd-compliance="true" and the subsequent stripping of data-* attributes. Please ensure example outputs accurately reflect the code's behavior.

TABLE_DTD_FIX_DOCUMENTATION.md (64)

high

This code snippet in the documentation shows tr.set('class', 'dtd-compliance-row'). However, the actual implementation in MasterPipeline.py (line 1246) sets tr.set('data-dtd-compliance', 'true'). This is a significant inconsistency between the documentation and the code. Please update the documentation to reflect the correct attribute being set.

DATA_ATTRIBUTE_STRIPPING_FIX.md (72)

medium

This CSS rule targets tr.dtd-compliance-row and tr[data-dtd-compliance="true"]. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true" on the XML element, and the _rebuild_table_section method (lines 2166-2168) explicitly filters out rows with data-dtd-compliance="true" from the HTML output. This means these rows should not appear in the HTML at all, rendering this CSS rule ineffective for its stated purpose. Please clarify if these rows are intended to be present in the HTML and hidden, or if they are filtered out entirely. If filtered out, this CSS rule is unnecessary and misleading in the documentation.

IMPLEMENTATION_SUMMARY_TEX_MATH_FIX.md (33)

medium

The example for range citations ^{1-3} shows a conversion to <sup><xref>1</xref>-<xref>2</xref>-<xref>3</xref></sup>. While the implementation in MasterPipeline.py (lines 1053-1056) supports this, rendering a range like 1-3 as 1-2-3 with multiple dashes is an unusual format. Typically, ranges are rendered as 1-3 or 1,2,3. Please confirm if this multi-dash format is the desired and PMC-compliant output, or if it should be adjusted to a more standard representation.

MasterPipeline.py (1211-1212)

medium

These lines contain only whitespace. Please remove them to maintain code cleanliness and adhere to standard formatting practices.

TEX_MATH_CITATION_FIX.md (28)

medium

The example for range citations ^{1-3} shows a conversion to <sup><xref>1</xref>-<xref>2</xref>-<xref>3</xref></sup>. While the implementation in MasterPipeline.py (lines 1053-1056) supports this, rendering a range like 1-3 as 1-2-3 with multiple dashes is an unusual format. Typically, ranges are rendered as 1-3 or 1,2,3. Please confirm if this multi-dash format is the desired and PMC-compliant output, or if it should be adjusted to a more standard representation.

IMPLEMENTATION_SUMMARY_TEX_MATH_FIX.md (94-96)

medium

This example further illustrates the multi-dash rendering for range citations (10-11-12-13-14-15). As noted previously, this is an unconventional format for a citation range. Please confirm if this is the intended output or if a more standard representation (e.g., 10-15 or 10,11,12,13,14,15) is preferred for PMC compliance and readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants