Conversation
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…ijax Add GCS persistence for AI-native data collection
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…idden row Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
… for empty td.text Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…rrors Fix table DTD validation: enforce element order and tbody requirement
…e rows from HTML Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
[WIP] Fix DTD validation error for tr element class attribute
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…cross-platform paths Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…ors-again Strip data-* attributes for DTD/XSD compliance and enhance table CSS
- Created tools/fetch_conversion.py for fetching conversion details by ID - Added GET /conversion/<conversion_id> API endpoint in app.py - Updated README.md with conversion ID format and debugging documentation - Script supports GCS file download and metrics retrieval Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Add unique id attributes to tex-math elements (texmath1, texmath2, etc.) - Add unique id attributes to mml:math elements (mml1, mml2, etc.) - Convert named-content anchors to proper ref elements in ref-list - Update xref rid attributes to lowercase (ref1 instead of Ref1) - Extract and preserve reference text from named-content context - Remove orphaned empty named-content elements - Add comprehensive test suite for DTD fixes Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Remove redundant re module import in MasterPipeline.py - Fix formatting issues in fetch_conversion.py for numeric fields - Replace deprecated datetime.utcnow() with datetime.now() - Add proper type checking before applying numeric formatting Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Create comprehensive IMPLEMENTATION_DETAILS.md documentation - Document conversion ID format and usage - Explain DTD validation fixes in detail - Include before/after examples - Add test coverage summary - Document impact and usage guidelines Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
Add conversion ID lookup and fix PMC DTD validation errors
- Implement _fix_tex_math_citations() method to convert problematic tex-math elements containing citation superscripts to PMC-compliant xref elements
- Handle single citations (^{5}), multiple citations (^{5,6}), and range citations (^{1-3})
- Preserve leading punctuation and tail text
- Create placeholder ref elements in ref-list for cited references
- Add comprehensive unit tests for all citation patterns
- Integration with _post_process_xml() method
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Add comprehensive integration tests for PMC compliance
- Test exact problematic pattern from issue (Line 515: .^{5,6})
- Verify mixed content handling (citations + real formulas)
- Add PMC Style Checker validation test (skipped if xsltproc not available)
- All tests passing successfully
Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Document problem, solution, implementation details - Include before/after examples - List all test coverage - Verify all acceptance criteria met - Add references to PMC guidelines and JATS documentation Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
- Complete end-to-end validation showing all features working - Implementation summary with before/after examples - All acceptance criteria met and verified - Ready for code review and merge Co-authored-by: findbhavin <19400906+findbhavin@users.noreply.github.com>
…rors Fix PMC Style Check errors: Convert tex-math citation fragments to xref elements
Summary of ChangesHello @findbhavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces several key enhancements to the OmniFormat XML JATS PMD conversion pipeline. It focuses on improving DTD compliance, adding Google Cloud Storage integration for data persistence, and providing debugging tools. The changes ensure that the generated XML files meet PMC submission requirements and improve the overall robustness and maintainability of the conversion process. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces significant improvements across several areas, including robust Google Cloud Storage (GCS) integration for persistence and enhanced debugging capabilities with a new conversion ID lookup utility. Crucially, it addresses multiple DTD validation errors, particularly concerning tex-math and mml:math ID attributes, and the conversion of named-content to proper ref elements. The changes also include a comprehensive fix for HTML table DTD compliance, ensuring correct element ordering and the presence of tbody elements. However, there are inconsistencies in the documentation and example outputs regarding the handling of data-dtd-compliance attributes versus class="dtd-compliance-row" which need to be resolved to accurately reflect the implementation.
I am having trouble creating individual review comments. Click here to see my feedback.
templates/style.css (82-84)
This CSS rule targets tr.dtd-compliance-row. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true" on the XML element, not a class attribute. The _rebuild_table_section method (lines 2166-2168) also explicitly filters out rows with data-dtd-compliance="true" from the HTML output. This means these rows should not appear in the HTML at all, rendering this CSS rule ineffective for its stated purpose. Please update the Python code to set the class attribute if this styling is intended, or remove this CSS rule if the rows are filtered out entirely.
TABLE_DTD_FIX_DOCUMENTATION.md (82-84)
This CSS rule targets tr.dtd-compliance-row. As noted in previous comments, the Python code sets data-dtd-compliance="true", not a class attribute. This CSS rule will not apply to the elements generated by the code. Please update the CSS rule to tr[data-dtd-compliance="true"] or ensure the Python code sets the class attribute if this styling is intended.
PR_TABLE_FIX_SUMMARY.md (70)
This example XML output shows class="dtd-compliance-row". This is inconsistent with the MasterPipeline.py implementation (line 1246) which sets data-dtd-compliance="true". Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). Please update this example to accurately reflect the attribute being set by the code and clarify its presence in the final XML.
IMPLEMENTATION_DETAILS.md (171-181)
The code snippet provided in the "Key Code" section is incomplete and contains undefined functions and variables (e.g., refs, extract_reference_text, create_ref_element, update_xrefs, remove_named_content). This makes the example misleading and unhelpful for understanding the implementation. Please provide a complete and accurate code example or refer to the actual implementation in MasterPipeline.py.
TABLE_DTD_FIX_DOCUMENTATION.md (180-182)
This example XML output shows class="dtd-compliance-row". This is inconsistent with the MasterPipeline.py implementation (line 1246) which sets data-dtd-compliance="true". Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). Please update this example to accurately reflect the attribute being set by the code and clarify its presence in the final XML.
Output files/article.xml (70-72)
This example XML output shows class="dtd-compliance-row" for the placeholder <tr> element. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true", not a class attribute. Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). This example output is inconsistent with the actual implementation and the stated stripping logic. Please update this example to reflect the correct attribute (data-dtd-compliance="true") and clarify if these attributes are intended to be present in the final XML or stripped.
Output files/article.xml (299-301)
Similar to the previous comment, this example XML output shows class="dtd-compliance-row" which contradicts the implementation setting data-dtd-compliance="true" and the subsequent stripping of data-* attributes. Please ensure example outputs accurately reflect the code's behavior.
TABLE_DTD_FIX_DOCUMENTATION.md (141-143)
This example XML output shows class="dtd-compliance-row". This is inconsistent with the MasterPipeline.py implementation (line 1246) which sets data-dtd-compliance="true". Additionally, the data-* attributes are stripped from the XML output (MasterPipeline.py lines 1761-1773). Please update this example to accurately reflect the attribute being set by the code and clarify its presence in the final XML.
Output files/article.xml (443-445)
This example XML output shows class="dtd-compliance-row" which contradicts the implementation setting data-dtd-compliance="true" and the subsequent stripping of data-* attributes. Please ensure example outputs accurately reflect the code's behavior.
PR_TABLE_FIX_SUMMARY.md (32-34)
This CSS rule targets tr.dtd-compliance-row. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true" on the XML element, not a class attribute. The _rebuild_table_section method (lines 2166-2168) also filters based on data-dtd-compliance="true". This means the CSS rule as written here will not apply to the rows generated by the Python code. Please update the CSS rule to tr[data-dtd-compliance="true"] or ensure the Python code sets the class attribute if this styling is intended.
Output files/article.xml (602-604)
This example XML output shows class="dtd-compliance-row" which contradicts the implementation setting data-dtd-compliance="true" and the subsequent stripping of data-* attributes. Please ensure example outputs accurately reflect the code's behavior.
TABLE_DTD_FIX_DOCUMENTATION.md (64)
This code snippet in the documentation shows tr.set('class', 'dtd-compliance-row'). However, the actual implementation in MasterPipeline.py (line 1246) sets tr.set('data-dtd-compliance', 'true'). This is a significant inconsistency between the documentation and the code. Please update the documentation to reflect the correct attribute being set.
DATA_ATTRIBUTE_STRIPPING_FIX.md (72)
This CSS rule targets tr.dtd-compliance-row and tr[data-dtd-compliance="true"]. However, the MasterPipeline.py code (line 1246) sets data-dtd-compliance="true" on the XML element, and the _rebuild_table_section method (lines 2166-2168) explicitly filters out rows with data-dtd-compliance="true" from the HTML output. This means these rows should not appear in the HTML at all, rendering this CSS rule ineffective for its stated purpose. Please clarify if these rows are intended to be present in the HTML and hidden, or if they are filtered out entirely. If filtered out, this CSS rule is unnecessary and misleading in the documentation.
IMPLEMENTATION_SUMMARY_TEX_MATH_FIX.md (33)
The example for range citations ^{1-3} shows a conversion to <sup><xref>1</xref>-<xref>2</xref>-<xref>3</xref></sup>. While the implementation in MasterPipeline.py (lines 1053-1056) supports this, rendering a range like 1-3 as 1-2-3 with multiple dashes is an unusual format. Typically, ranges are rendered as 1-3 or 1,2,3. Please confirm if this multi-dash format is the desired and PMC-compliant output, or if it should be adjusted to a more standard representation.
MasterPipeline.py (1211-1212)
These lines contain only whitespace. Please remove them to maintain code cleanliness and adhere to standard formatting practices.
TEX_MATH_CITATION_FIX.md (28)
The example for range citations ^{1-3} shows a conversion to <sup><xref>1</xref>-<xref>2</xref>-<xref>3</xref></sup>. While the implementation in MasterPipeline.py (lines 1053-1056) supports this, rendering a range like 1-3 as 1-2-3 with multiple dashes is an unusual format. Typically, ranges are rendered as 1-3 or 1,2,3. Please confirm if this multi-dash format is the desired and PMC-compliant output, or if it should be adjusted to a more standard representation.
IMPLEMENTATION_SUMMARY_TEX_MATH_FIX.md (94-96)
This example further illustrates the multi-dash rendering for range citations (10-11-12-13-14-15). As noted previously, this is an unconventional format for a citation range. Please confirm if this is the intended output or if a more standard representation (e.g., 10-15 or 10,11,12,13,14,15) is preferred for PMC compliance and readability.
No description provided.