Staging by findbhavin · Pull Request #45 · findbhavin/OmniFormat_XML_JATS_PMD

findbhavin · 2026-01-22T15:28:47Z

No description provided.

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…ijax Add GCS persistence for AI-native data collection

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…idden row Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

… for empty td.text Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…rrors Fix table DTD validation: enforce element order and tbody requirement

…e rows from HTML Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

[WIP] Fix DTD validation error for tr element class attribute

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…cross-platform paths Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…ors-again Strip data-* attributes for DTD/XSD compliance and enhance table CSS

- Created tools/fetch_conversion.py for fetching conversion details by ID - Added GET /conversion/<conversion_id> API endpoint in app.py - Updated README.md with conversion ID format and debugging documentation - Script supports GCS file download and metrics retrieval Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

- Add unique id attributes to tex-math elements (texmath1, texmath2, etc.) - Add unique id attributes to mml:math elements (mml1, mml2, etc.) - Convert named-content anchors to proper ref elements in ref-list - Update xref rid attributes to lowercase (ref1 instead of Ref1) - Extract and preserve reference text from named-content context - Remove orphaned empty named-content elements - Add comprehensive test suite for DTD fixes Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

- Remove redundant re module import in MasterPipeline.py - Fix formatting issues in fetch_conversion.py for numeric fields - Replace deprecated datetime.utcnow() with datetime.now() - Add proper type checking before applying numeric formatting Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

- Create comprehensive IMPLEMENTATION_DETAILS.md documentation - Document conversion ID format and usage - Explain DTD validation fixes in detail - Include before/after examples - Add test coverage summary - Document impact and usage guidelines Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Add conversion ID lookup and fix PMC DTD validation errors

- Implement _fix_tex_math_citations() method to convert problematic tex-math elements containing citation superscripts to PMC-compliant xref elements - Handle single citations (^{5}), multiple citations (^{5,6}), and range citations (^{1-3}) - Preserve leading punctuation and tail text - Create placeholder ref elements in ref-list for cited references - Add comprehensive unit tests for all citation patterns - Integration with _post_process_xml() method Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

- Add comprehensive integration tests for PMC compliance - Test exact problematic pattern from issue (Line 515: .^{5,6}) - Verify mixed content handling (citations + real formulas) - Add PMC Style Checker validation test (skipped if xsltproc not available) - All tests passing successfully Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

- Document problem, solution, implementation details - Include before/after examples - List all test coverage - Verify all acceptance criteria met - Add references to PMC guidelines and JATS documentation Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

- Complete end-to-end validation showing all features working - Implementation summary with before/after examples - All acceptance criteria met and verified - Ready for code review and merge Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

…rors Fix PMC Style Check errors: Convert tex-math citation fragments to xref elements

gemini-code-assist · 2026-01-22T15:30:05Z

Summary of Changes

Hello @findbhavin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces several key enhancements to the OmniFormat XML JATS PMD conversion pipeline. It focuses on improving DTD compliance, adding Google Cloud Storage integration for data persistence, and providing debugging tools. The changes ensure that the generated XML files meet PMC submission requirements and improve the overall robustness and maintainability of the conversion process.

Highlights

DTD Validation Fixes: Addresses DTD validation errors by stripping invalid data attributes, adding missing IDs to math elements, and converting named-content references to proper ref elements.
GCS Integration: Adds Google Cloud Storage support for persisting application data, including input files, output packages, and conversion metrics, with non-blocking operations.
Conversion ID Lookup Utility: Introduces a conversion ID lookup utility via a standalone script and API endpoint to facilitate debugging and retrieval of conversion details.
tex-math Citation Conversion Fix: Fixes an issue where citation numbers in superscript are incorrectly interpreted as LaTeX math expressions, converting them to proper PMC-compliant xref elements.
HTML Table DTD Validation Errors: Fixes HTML table DTD validation errors by reordering table elements and enforcing tbody requirements, ensuring compliance with JATS DTD structure.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces significant improvements across several areas, including robust Google Cloud Storage (GCS) integration for persistence and enhanced debugging capabilities with a new conversion ID lookup utility. Crucially, it addresses multiple DTD validation errors, particularly concerning tex-math and mml:math ID attributes, and the conversion of named-content to proper ref elements. The changes also include a comprehensive fix for HTML table DTD compliance, ensuring correct element ordering and the presence of tbody elements. However, there are inconsistencies in the documentation and example outputs regarding the handling of data-dtd-compliance attributes versus class="dtd-compliance-row" which need to be resolved to accurately reflect the implementation.

I am having trouble creating individual review comments. Click here to see my feedback.

templates/style.css (82-84)

This CSS rule targets tr.dtd-compliance-row. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true" on the XML element, not a class attribute. The _rebuild_table_section method (lines 2166-2168) also explicitly filters out rows with data-dtd-compliance="true" from the HTML output. This means these rows should not appear in the HTML at all, rendering this CSS rule ineffective for its stated purpose. Please update the Python code to set the class attribute if this styling is intended, or remove this CSS rule if the rows are filtered out entirely.

TABLE_DTD_FIX_DOCUMENTATION.md (82-84)

This CSS rule targets tr.dtd-compliance-row. As noted in previous comments, the Python code sets data-dtd-compliance="true", not a class attribute. This CSS rule will not apply to the elements generated by the code. Please update the CSS rule to tr[data-dtd-compliance="true"] or ensure the Python code sets the class attribute if this styling is intended.

PR_TABLE_FIX_SUMMARY.md (70)

This example XML output shows class="dtd-compliance-row". This is inconsistent with the MasterPipeline.py implementation (line 1246) which sets data-dtd-compliance="true". Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). Please update this example to accurately reflect the attribute being set by the code and clarify its presence in the final XML.

IMPLEMENTATION_DETAILS.md (171-181)

The code snippet provided in the "Key Code" section is incomplete and contains undefined functions and variables (e.g., refs, extract_reference_text, create_ref_element, update_xrefs, remove_named_content). This makes the example misleading and unhelpful for understanding the implementation. Please provide a complete and accurate code example or refer to the actual implementation in MasterPipeline.py.

TABLE_DTD_FIX_DOCUMENTATION.md (180-182)

This example XML output shows class="dtd-compliance-row". This is inconsistent with the MasterPipeline.py implementation (line 1246) which sets data-dtd-compliance="true". Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). Please update this example to accurately reflect the attribute being set by the code and clarify its presence in the final XML.

Output files/article.xml (70-72)

This example XML output shows class="dtd-compliance-row" for the placeholder <tr> element. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true", not a class attribute. Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). This example output is inconsistent with the actual implementation and the stated stripping logic. Please update this example to reflect the correct attribute (data-dtd-compliance="true") and clarify if these attributes are intended to be present in the final XML or stripped.

Output files/article.xml (299-301)

Similar to the previous comment, this example XML output shows class="dtd-compliance-row" which contradicts the implementation setting data-dtd-compliance="true" and the subsequent stripping of data-* attributes. Please ensure example outputs accurately reflect the code's behavior.

TABLE_DTD_FIX_DOCUMENTATION.md (141-143)

This example XML output shows class="dtd-compliance-row". This is inconsistent with the MasterPipeline.py implementation (line 1246) which sets data-dtd-compliance="true". Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). Please update this example to accurately reflect the attribute being set by the code and clarify its presence in the final XML.

Output files/article.xml (443-445)

This example XML output shows class="dtd-compliance-row" which contradicts the implementation setting data-dtd-compliance="true" and the subsequent stripping of data-* attributes. Please ensure example outputs accurately reflect the code's behavior.

PR_TABLE_FIX_SUMMARY.md (32-34)

This CSS rule targets tr.dtd-compliance-row. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true" on the XML element, not a class attribute. The _rebuild_table_section method (lines 2166-2168) also filters based on data-dtd-compliance="true". This means the CSS rule as written here will not apply to the rows generated by the Python code. Please update the CSS rule to tr[data-dtd-compliance="true"] or ensure the Python code sets the class attribute if this styling is intended.

Output files/article.xml (602-604)

This example XML output shows class="dtd-compliance-row" which contradicts the implementation setting data-dtd-compliance="true" and the subsequent stripping of data-* attributes. Please ensure example outputs accurately reflect the code's behavior.

TABLE_DTD_FIX_DOCUMENTATION.md (64)

This code snippet in the documentation shows tr.set('class', 'dtd-compliance-row'). However, the actual implementation in MasterPipeline.py (line 1246) sets tr.set('data-dtd-compliance', 'true'). This is a significant inconsistency between the documentation and the code. Please update the documentation to reflect the correct attribute being set.

DATA_ATTRIBUTE_STRIPPING_FIX.md (72)

This CSS rule targets tr.dtd-compliance-row and tr[data-dtd-compliance="true"]. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true" on the XML element, and the _rebuild_table_section method (lines 2166-2168) explicitly filters out rows with data-dtd-compliance="true" from the HTML output. This means these rows should not appear in the HTML at all, rendering this CSS rule ineffective for its stated purpose. Please clarify if these rows are intended to be present in the HTML and hidden, or if they are filtered out entirely. If filtered out, this CSS rule is unnecessary and misleading in the documentation.

IMPLEMENTATION_SUMMARY_TEX_MATH_FIX.md (33)

The example for range citations ^{1-3} shows a conversion to <sup><xref>1</xref>-<xref>2</xref>-<xref>3</xref></sup>. While the implementation in MasterPipeline.py (lines 1053-1056) supports this, rendering a range like 1-3 as 1-2-3 with multiple dashes is an unusual format. Typically, ranges are rendered as 1-3 or 1,2,3. Please confirm if this multi-dash format is the desired and PMC-compliant output, or if it should be adjusted to a more standard representation.

MasterPipeline.py (1211-1212)

These lines contain only whitespace. Please remove them to maintain code cleanliness and adhere to standard formatting practices.

TEX_MATH_CITATION_FIX.md (28)

The example for range citations ^{1-3} shows a conversion to <sup><xref>1</xref>-<xref>2</xref>-<xref>3</xref></sup>. While the implementation in MasterPipeline.py (lines 1053-1056) supports this, rendering a range like 1-3 as 1-2-3 with multiple dashes is an unusual format. Typically, ranges are rendered as 1-3 or 1,2,3. Please confirm if this multi-dash format is the desired and PMC-compliant output, or if it should be adjusted to a more standard representation.

IMPLEMENTATION_SUMMARY_TEX_MATH_FIX.md (94-96)

This example further illustrates the multi-dash rendering for range citations (10-11-12-13-14-15). As noted previously, this is an unconventional format for a citation range. Please confirm if this is the intended output or if a more standard representation (e.g., 10-15 or 10,11,12,13,14,15) is preferred for PMC compliance and readability.

Copilot AI and others added 30 commits January 22, 2026 05:10

Initial plan

cf930db

Add GCS support with gcs_utils module and app.py integration

77089aa

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Add comprehensive tests for GCS utilities

c545759

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Add GCS integration documentation and complete implementation

e04e4ab

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Set omnijaxstorage as default GCS bucket name

c175681

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Merge pull request #38 from findbhavin/copilot/add-gcs-support-to-omn…

9783a6b

…ijax Add GCS persistence for AI-native data collection

Add files via upload

874f231

Initial plan

2168295

Add table element reordering to fix DTD validation errors

246518e

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Fix tbody requirement for tables with thead - always add tbody with h…

c43c668

…idden row Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Address code review: use CSS class instead of inline styles, use None…

b083647

… for empty td.text Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Add comprehensive documentation for table DTD fix

1178284

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Merge pull request #40 from findbhavin/copilot/fix-table-validation-e…

6800089

…rrors Fix table DTD validation: enforce element order and tbody requirement

Initial plan

e241fde

Replace class attribute with data-dtd-compliance and filter complianc…

454244f

…e rows from HTML Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Add comprehensive unit tests for DTD compliance fixes

fbf9294

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Merge pull request #41 from findbhavin/copilot/fix-dtd-validation-error

81cbc52

[WIP] Fix DTD validation error for tr element class attribute

Initial plan

80c1a5c

Add data-* attribute stripping and enhance table CSS styling

8300cdb

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Add comprehensive documentation for data-* attribute stripping fix

bd00332

Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Address code review feedback - remove redundant CSS selector and fix …

890c0eb

…cross-platform paths Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>

Merge pull request #42 from findbhavin/copilot/fix-dtd-validation-err…

abe3f91

…ors-again Strip data-* attributes for DTD/XSD compliance and enhance table CSS

Initial plan

2f82548

Merge pull request #43 from findbhavin/copilot/add-conversion-id-lookup

f7b34c0

Add conversion ID lookup and fix PMC DTD validation errors

Initial plan

040c124

Copilot AI and others added 6 commits January 22, 2026 12:52

Merge pull request #44 from findbhavin/copilot/fix-pmc-style-check-er…

36886e1

…rors Fix PMC Style Check errors: Convert tex-math citation fragments to xref elements

Update MasterPipeline.py

0fe6340

Merge branch 'main' into Staging

aa17f3f

findbhavin merged commit 40d58d2 into main Jan 22, 2026
0 of 4 checks passed

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staging#45

Staging#45
findbhavin merged 36 commits intomainfrom
Staging

findbhavin commented Jan 22, 2026

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

findbhavin commented Jan 22, 2026

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

templates/style.css (82-84)

TABLE_DTD_FIX_DOCUMENTATION.md (82-84)

PR_TABLE_FIX_SUMMARY.md (70)

IMPLEMENTATION_DETAILS.md (171-181)

TABLE_DTD_FIX_DOCUMENTATION.md (180-182)

Output files/article.xml (70-72)

Output files/article.xml (299-301)

TABLE_DTD_FIX_DOCUMENTATION.md (141-143)

Output files/article.xml (443-445)

PR_TABLE_FIX_SUMMARY.md (32-34)

Output files/article.xml (602-604)

TABLE_DTD_FIX_DOCUMENTATION.md (64)

DATA_ATTRIBUTE_STRIPPING_FIX.md (72)

IMPLEMENTATION_SUMMARY_TEX_MATH_FIX.md (33)

MasterPipeline.py (1211-1212)

TEX_MATH_CITATION_FIX.md (28)

IMPLEMENTATION_SUMMARY_TEX_MATH_FIX.md (94-96)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants