OmniJAX is a tool that converts your Microsoft Word documents into professional formats required for publishing scientific articles in academic journals and digital libraries. Think of it as a translator that takes your familiar Word document and converts it into specialized formats that publishers need, while ensuring everything meets strict formatting standards.
In simple terms: Upload your Word document → Get publication-ready files that meet journal requirements.
- Researchers & Authors: Submit manuscripts to journals without worrying about complex formatting requirements
- Publishers: Convert submitted manuscripts to standardized formats for your digital library
- Academic Institutions: Prepare research papers and dissertations for institutional repositories
- Students: Format thesis and research papers according to publication standards
Before diving into the details, here are the key terms you'll encounter:
| Term | Simple Explanation |
|---|---|
| JATS (Journal Article Tag Suite) | A standardized format for scientific articles. Think of it like HTML but specifically designed for research papers. It ensures your article can be properly displayed and searched in digital libraries. |
| PMC (PubMed Central) | A free digital library run by the US National Institutes of Health that hosts biomedical and life sciences research. It requires articles to be submitted in specific formats. |
| XML (eXtensible Markup Language) | A structured format for organizing information, similar to HTML. It uses tags like <title> and <author> to mark up different parts of your document. |
| DTD (Document Type Definition) | A set of rules that define what's allowed in an XML document, like a grammar book for documents. |
| XSD (XML Schema Definition) | Another way to define rules for XML documents, more modern than DTD. |
| Validation | The process of checking if your document follows all the required rules and standards. |
| Metadata | Information about your article (like title, authors, publication date) that helps people find and cite it. |
- ✅ Easy to Use: Simply upload your Word document and download the converted files
- ✅ No Formatting Expertise Needed: The tool automatically handles complex formatting requirements
- ✅ Time Saving: No manual reformatting or learning complex XML editors
- ✅ Error Prevention: Automatic validation ensures your submission meets journal requirements
- ✅ Multiple Outputs: Get several versions of your document for different purposes
- ✅ Standards Compliant: Full JATS 1.4 and PMC compliance
- ✅ Automated Validation: Built-in validation against official schemas
- ✅ Extensible Pipeline: Modular Python architecture for customization
- ✅ Multiple Output Formats: XML and HTML generation
- ✅ API Access: RESTful API for integration with other systems
Using the Web Interface (Easiest for most users):
- Open your web browser
- Navigate to the OmniJAX website (URL provided by your administrator)
- You'll see a simple upload interface
Using the Command Line (For technical users):
python app.py
# Then open http://localhost:8080 in your browser- Click the "Choose File" button or drag-and-drop your Word document onto the page
- Supported format: Microsoft Word (.docx) files
- Maximum file size: 50 MB
- Click "Convert & Download Package"
The tool will show you a progress bar with updates like:
- "Processing document..." (0-20%)
- "Converting to JATS XML..." (20-40%)
- "Generating HTML..." (40-80%)
- "Validating output..." (80-100%)
This usually takes 30 seconds to 2 minutes depending on document size.
Once complete (100%), click "Download Package" to get a ZIP file containing:
- 📄 article.html - HTML version for viewing in your browser
- 📄 README.txt - Explains what each file is for
- 📄 article.xml - JATS XML for schema validation (XSD-compliant)
- 📄 articledtd.xml - JATS XML for PMC submission (DTD-compliant)
- 📄 article.html - HTML version with embedded images
- 📁 media/ - All extracted images from your document
- 📄 validation_report.json - Detailed validation results
- 📄 README.txt - Package documentation
┌─────────────────────┐
│ Your Word Document │
│ (article.docx) │
└──────────┬──────────┘
│
│ Upload to OmniJAX
▼
┌──────────────────────┐
│ OmniJAX Converter │
│ │
│ • Reads content │
│ • Structures data │
│ • Applies standards │
│ • Validates format │
└──────────┬───────────┘
│
│ Generates multiple outputs
▼
┌──────────────────────────────────────────────┐
│ Publication-Ready Files │
├──────────────────────────────────────────────┤
│ │
│ ├─ article.html (For viewing) │
│ │ Web-ready HTML version │
│ │ │
│ ├─ article.xml (For validation) │
│ │ Technical format for quality checks │
│ │ │
│ ├─ articledtd.xml (For PMC submission) │
│ │ Format required by PubMed Central │
│ │ │
│ └─ validation_report.json (Quality report) │
│ Shows what passed and what needs review │
│ │
└───────────────────────────────────────────────┘
Goal: Submit your biomedical research article to PMC
Steps:
- Upload your Word manuscript to OmniJAX
- Download the converted package
- Review the
validation_report.jsonto see if any issues were found - Upload
articledtd.xmlto the PMC Style Checker - If validation passes, submit to PMC
- If there are warnings, review and fix them in your original Word document, then re-convert
Goal: Get a web-ready HTML version of your article
Steps:
- Upload your Word document to OmniJAX
- Download the converted package
- Use
article.html- this has professional styling with proper tables, figures, and formatting - This HTML follows publication standards and looks great in any web browser
Goal: Submit to a journal that requires JATS XML format
Steps:
- Upload your manuscript to OmniJAX
- Download the package
- Submit
article.xmlto your journal's submission system - Include
article.htmlas a preview version - The journal can validate your XML against their requirements
The tool reads your Word document and identifies:
- Title and authors
- Abstract and keywords
- Main sections (Introduction, Methods, Results, etc.)
- Tables and figures
- References
- Acknowledgments
Converts your document into multiple professional formats:
- JATS XML: The standard format for scientific articles
- HTML: Web-friendly version with images
Automatically validates that your converted document:
- Meets JATS 1.4 standards (the current version of the scientific article format)
- Follows PMC requirements (if you're submitting to PubMed Central)
- Has proper structure (sections, metadata, citations)
- Contains all required elements
The tool enhances your document with:
- Professional table styling (borders, colors, spacing)
- Proper figure sizing and alignment
- Correct reference formatting
- Standardized metadata (author info, publication details)
Any content added by the tool for compliance is highlighted in yellow with a 📋 icon, so you can:
- See exactly what was added versus what was in your original document
- Review and update these sections with your specific information
- Understand why certain elements were added
If your document is missing required elements (like an abstract or specific metadata), OmniJAX can:
- Detect what's missing
- Add placeholder content that meets requirements
- Highlight these additions so you can review them
- Ensure your document passes validation
Different users need different things:
- Researchers: Get validated JATS XML
- Publishers: Get validated JATS XML
- Reviewers: Get easy-to-read HTML versions
- Archives: Get properly structured XML for long-term preservation
Watch the conversion happen:
- See exactly what step is running
- Know how long it will take
- Get immediate feedback if something goes wrong
| File Name | What It's For | When to Use It |
|---|---|---|
article.html |
HTML version for viewing | Viewing in a web browser, sharing online |
article.xml |
Technical XML file | For journal submissions |
validation_report.json |
Quality check results | See if there are any issues to fix |
README.txt |
File descriptions | Learn what each file does |
| File Name | Format | Purpose | Standards |
|---|---|---|---|
article.xml |
JATS XML 1.4 | Schema validation, XSD tools | No DOCTYPE, includes xsi:schemaLocation |
articledtd.xml |
JATS XML 1.4 | PMC submission, DTD validation | Includes DOCTYPE declaration |
article.html |
HTML5 | Web display | W3C compliant |
media/ |
Images | Extracted figures | Referenced from HTML/XML |
validation_report.json |
JSON | Validation results | JATS/PMC compliance report |
OmniJAX provides a RESTful API for integration with other systems:
curl -X POST -F "file=@document.docx" http://localhost:8080/convertResponse:
{
"conversion_id": "20260120_103000_abcd1234",
"status": "queued",
"message": "Conversion started"
}curl http://localhost:8080/status/20260120_103000_abcd1234Response:
{
"status": "processing",
"progress": 40,
"message": "Validating JATS XML",
"filename": "document.docx"
}curl -O http://localhost:8080/download/20260120_103000_abcd1234Get comprehensive conversion information including file paths and metrics:
curl http://localhost:8080/conversion/20260120_103000_abcd1234Response:
{
"conversion_id": "20260120_152731_42a34914",
"status": "completed",
"filename": "article.docx",
"processing_time": 45.23,
"input_size_mb": 2.5,
"output_size_mb": 3.1,
"input_file_gcs_path": "gs://omnijaxstorage/inputs/20260120_152731_42a34914_article.docx",
"output_file_gcs_path": "gs://omnijaxstorage/outputs/OmniJAX_20260120_152731_42a34914_article.zip",
"timestamp": "2026-01-20T15:28:16.123456Z"
}Every conversion is assigned a unique Conversion ID for tracking and debugging purposes.
YYYYMMDD_HHMMSS_<8-char-hex>
Components:
YYYYMMDD: Date (e.g., 20260120 for January 20, 2026)HHMMSS: Time in 24-hour format (e.g., 152731 for 3:27:31 PM)<8-char-hex>: Random 8-character hexadecimal string for uniqueness
Example: 20260120_152731_42a34914
When debugging a conversion issue, you can retrieve all information about a conversion using its ID:
Method 1: Using the API endpoint
# Get conversion details
curl http://localhost:8080/conversion/20260120_152731_42a34914
# Or with pretty formatting
curl http://localhost:8080/conversion/20260120_152731_42a34914 | jqMethod 2: Using the fetch_conversion.py script
The fetch_conversion.py script provides a convenient way to fetch conversion information and download files from Google Cloud Storage (GCS).
# Get conversion information
python tools/fetch_conversion.py 20260120_152731_42a34914
# Download files locally for inspection
python tools/fetch_conversion.py 20260120_152731_42a34914 --download
# Download to a specific directory
python tools/fetch_conversion.py 20260120_152731_42a34914 --download --output-dir /tmp/debug
# Output as JSON for scripting
python tools/fetch_conversion.py 20260120_152731_42a34914 --jsonSample Output:
======================================================================
CONVERSION ID: 20260120_152731_42a34914
======================================================================
📥 INPUT FILE:
Path: gs://omnijaxstorage/inputs/20260120_152731_42a34914_article.docx
Size: 2.50 MB
Created: 2026-01-20T15:27:31.123456Z
Content Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
📤 OUTPUT FILE:
Path: gs://omnijaxstorage/outputs/OmniJAX_20260120_152731_42a34914_article.zip
Size: 3.10 MB
Created: 2026-01-20T15:28:16.123456Z
Content Type: application/zip
📊 CONVERSION METRICS:
Status: completed
Filename: article.docx
Processing Time: 45.23s
Input Size: 2.50 MB
Output Size: 3.10 MB
Timestamp: 2026-01-20T15:28:16.123456Z
======================================================================
Scenario 1: Conversion failed and you need to inspect the input file
# Download the input file that caused the failure
python tools/fetch_conversion.py <conversion_id> --downloadScenario 2: Need to verify output for a specific conversion
# Download both input and output files
python tools/fetch_conversion.py <conversion_id> --download --output-dir ./debug_conv_123Scenario 3: Checking conversion status programmatically
# Get JSON output for scripting
python tools/fetch_conversion.py <conversion_id> --json > conversion_info.jsonScenario 4: Finding conversion details from logs
# If you have the conversion ID from logs, get full details
curl http://localhost:8080/conversion/<conversion_id>from MasterPipeline import HighFidelityConverter
# Initialize with your Word document
converter = HighFidelityConverter('path/to/document.docx')
# Run the complete conversion
converter.run()
# All outputs are saved in the output directory
print(f"Files generated in: {converter.output_dir}")from MasterPipeline import HighFidelityConverter
converter = HighFidelityConverter('document.docx')
# Step 1: Convert to JATS XML
converter.convert_to_jats() # Creates article.xml
# Step 2: Validate the XML
converter.validate_jats() # Checks against JATS schema
# Step 3: Add PMC compliance
converter.add_doctype() # Creates articledtd.xml
# Step 4: Generate HTML
converter.convert_to_html() # Creates article.html + media/
# Step 5: Generate HTML
converter.convert_to_html() # Creates article.html
# Step 6: Run all validations
converter.validate_all() # Creates validation_report.jsonPossible Causes:
- Word document is corrupted
- File is too large (>50 MB)
- Document contains unsupported elements
Solutions:
- Try opening and re-saving your Word document
- Reduce file size by compressing images
- Remove any embedded objects that might cause issues
- Check the error message for specific details
What It Means: Your document converted successfully, but some elements don't perfectly match publication standards.
Solutions:
- Open
validation_report.jsonto see specific warnings - Most warnings are minor and don't prevent submission
- For critical issues, update your Word document and re-convert
- Consult your target journal's submission guidelines
What It Means: The tool added these elements to meet formatting requirements.
Solutions:
- Review each highlighted section
- Replace placeholder text with your actual information
- If an element isn't needed, note it in your submission
- This is normal and helps ensure compliance
Possible Causes:
- Very large document taking time to process
- Server is busy
- Network connection issue
Solutions:
- Wait a few more minutes (large documents can take 2-5 minutes)
- Refresh the page and check if conversion completed
- Try uploading again
- Check your internet connection
Possible Causes:
- Conversion hasn't finished yet
- Download interrupted
Solutions:
- Wait for progress to reach 100% before downloading
- Try downloading again
- Check your downloads folder for previous attempts
- Disable download managers that might interfere
If you want to install and run OmniJAX on your own system:
System Requirements:
- Operating System: Linux, macOS, or Windows with WSL2
- RAM: At least 4GB (8GB recommended)
- Disk Space: 2GB free
- Internet connection for initial setup
Software Requirements:
- Python 3.11 or newer
- Pandoc 3.x (document converter)
- Python 3.11+
- Pandoc (document converter)
Detailed Setup: See SETUP.md for complete installation instructions.
When you convert a document, OmniJAX generates a validation_report.json file. Here's what it tells you:
{
"jats_validation": {
"status": "PASS",
"message": "Your XML meets JATS 1.4 standards ✓"
},
"pmc_compliance": {
"status": "PASS",
"message": "Ready for PMC submission ✓",
"warnings": [
"Consider adding keywords for better searchability"
]
},
"document_structure": {
"tables": 3,
"figures": 5,
"references": 25,
"sections": 6
}
}- PASS: Everything is good! Your document meets all requirements.
- WARNING: Document converted successfully, but there are suggestions for improvement. Usually safe to proceed.
- FAIL: There are critical issues that need to be fixed before submission.
| Warning | What It Means | Action Needed |
|---|---|---|
| "Missing keywords" | Your article should have keywords for searchability | Add keywords to your Word document |
| "Author affiliation incomplete" | Author institution info is partial | Add full institution details |
| "Figure caption needs alt text" | Figures need descriptions for accessibility | Add descriptive captions |
| "Reference formatting inconsistent" | Citations aren't uniform | Check reference list formatting |
| "DOI placeholder detected" | Article needs a real DOI | Get DOI from publisher or leave for now |
- Convert your document with OmniJAX
- Review validation report - Open
validation_report.jsonand check for any critical issues - Test with PMC Style Checker:
- Go to: https://pmc.ncbi.nlm.nih.gov/tools/stylechecker/
- Upload
articledtd.xml(not article.xml) - Review results
- Fix any errors in your original Word document and re-convert
- Submit to PMC using their online system
- Include:
articledtd.xml(the XML file)- Media files from the
media/folder if applicable
Different journals have different requirements:
- Check journal guidelines - See what format they want (XML or HTML)
- Use the appropriate file:
- JATS XML required? → Use
article.xml - HTML preferred? → Use
article.html - PMC/NLM compliance needed? → Use
articledtd.xml
- JATS XML required? → Use
- Include supplementary materials:
- Upload images from
media/folder if requested - Attach
validation_report.jsonif journal wants proof of validation
- Upload images from
Q: Is OmniJAX free to use? A: OmniJAX is open-source software. Your institution or organization may provide access, or you can install it on your own server.
Q: What types of documents work best? A: Research articles, review papers, case studies, and technical reports. The document should have a clear structure with sections like Introduction, Methods, Results, etc.
Q: Can I convert multiple documents at once? A: Currently, you need to convert one document at a time. For batch processing, use the API or Python library.
Q: How long does conversion take? A: Usually 30 seconds to 2 minutes, depending on document size and complexity. Large documents with many images may take up to 5 minutes.
Q: Is my document data kept private? A: Check with your system administrator. In a self-hosted setup, all data stays on your server.
Q: Does it work with older Word formats (.doc)? A: OmniJAX requires .docx format (Word 2007 and newer). To convert .doc files:
- Open in Microsoft Word
- Save As → Word Document (.docx)
- Then use OmniJAX
Q: Can I convert Google Docs? A: Yes, but you need to download first:
- In Google Docs: File → Download → Microsoft Word (.docx)
- Upload the downloaded .docx file to OmniJAX
Q: What about LibreOffice or OpenOffice documents? A: Save your document as .docx format first, then use OmniJAX.
Q: What's the difference between article.xml and articledtd.xml? A: Both have the same content, but:
article.xml- For general XML validation and modern toolsarticledtd.xml- Required for PMC Style Checker and PMC submission
Q: Can I edit the XML files? A: Yes, but you'll need an XML editor. For most users, it's easier to edit the Word document and re-convert.
Q: The HTML doesn't look right. What should I do? A:
- Check if the original Word document looks correct
- If the Word doc looks wrong, fix formatting there first
- Re-upload and convert again
- If issues persist, check if your Word doc has unusual formatting
Q: Validation failed. Can I still submit? A: Depends on the error:
- Warnings: Usually okay to proceed, but review them
- Critical errors: Need to fix before submission
- Check your target journal's requirements
Q: Why is some text highlighted in yellow? A: This shows content that OmniJAX added to meet formatting standards. Review and update these sections with your actual information.
- This README: Overview and user guide
- SETUP.md: Installation instructions
- TESTING_GUIDE.md: For developers and testers
- JATS Website: https://jats.nlm.nih.gov/ - Learn about the JATS standard
- PMC Guidelines: https://pmc.ncbi.nlm.nih.gov/tagging-guidelines/ - PMC formatting requirements
- PMC Style Checker: https://pmc.ncbi.nlm.nih.gov/tools/stylechecker/ - Validate your XML
For technical support:
- Check the troubleshooting section above
- Review the validation report for specific errors
- Contact your system administrator
- For development issues, check the project repository
Interested in improving OmniJAX or adapting it for your needs?
- Repository: View the source code and contribute
- Testing: See TESTING_GUIDE.md for how to run tests
- Architecture: The tool is built in Python with a modular pipeline
If you need additional features:
- Check if there's an existing issue
- Create a new feature request with use case
- Consider contributing code if you have the skills
The following sections contain detailed technical information for advanced users, developers, and system administrators.
- JATS 1.4 Publishing DTD: https://public.nlm.nih.gov/projects/jats/publishing/1.4/
- PMC Tagging Guidelines: https://pmc.ncbi.nlm.nih.gov/tagging-guidelines/article/style/
- PMC Style Checker: https://pmc.ncbi.nlm.nih.gov/tools/stylechecker/
- Validates against official NLM XSD schemas
- Full PMC/NLM Style Checker compatibility
- Proper namespace declarations (XLink, MathML)
- xsi:schemaLocation injection for external validators
- MathML 2.0/3.0 support
- Automated PMC requirements checking
- Integrated PMC Style Checker XSLT validation
- DOI and metadata validation
- Author affiliation structure verification
- Table positioning (float/anchor)
- Figure and caption compliance
- Reference formatting validation
- Professional Table Styles: Enhanced borders, colors, and spacing for better readability
- Alternating row colors for improved visual clarity
- Professional header styling with subtle blue accents
- Optimized padding and spacing for clean presentation
- Smaller table font size (10pt) for better content fit
- Optimized Margins: Reduced left/right margins (0.5in) for better space utilization
- Enhanced Font Handling: CSS variables for consistent font usage across document
- Primary font stack: Liberation Serif, Times New Roman, DejaVu Serif
- Header font stack: Liberation Sans, Arial, Helvetica
- Enhanced Image Handling: Proper sizing and alignment with automatic aspect ratio preservation
- Compliance Text Highlighting: Visual indicators for DTD/PMC compliance additions
- Real-time progress updates during conversion
- Non-blocking file uploads
- Status polling via REST API
- Separate download endpoint for completed conversions
- Modern drag-and-drop UI with progress bar
- Fixes truncated headers
- Ensures PMC metadata requirements
- Validates accessibility compliance
- Proper author formatting with affiliations
- Special character encoding
- Professional content formatting for consistency
- Compliance Text Marking: AI-added content for compliance is automatically marked
- Table captions with proper positioning
- Media extraction to
/mediafolder - Superscript/subscript preservation
- Section ID generation
- Comprehensive validation reporting
.
├── MasterPipeline.py # Main conversion pipeline with JATS 1.4 compliance
├── app.py # Flask web application with async endpoints
├── Dockerfile # Container configuration
├── requirements.txt # Python dependencies
├── JATS-journalpublishing-*.xsd # JATS schema files
├── pmc-stylechecker/ # PMC Style Checker XSLT files
│ └── README.md # Installation instructions
├── templates/
│ ├── index.html # Modern async upload interface
│ └── style.css # PMC-compliant HTML styling
├── standard-modules/ # JATS XSD modules
│ ├── mathml2/ # MathML 2.0 schema
│ ├── xlink.xsd # XLink schema
│ └── xml.xsd # XML namespace schema
└── tools/
├── safe_render.py # Validation and rendering tool
└── add_doctype.py # DOCTYPE declaration utility for PMC validation
The converter ensures all PMC-required elements are present:
-
Article Root
dtd-version="1.4"article-typeattribute- XLink namespace:
xmlns:xlink="http://www.w3.org/1999/xlink" - MathML namespace:
xmlns:mml="http://www.w3.org/1998/Math/MathML"
-
Front Matter
<journal-meta>with journal information<article-meta>with:- DOI (
<article-id pub-id-type="doi">) - Article title
- Author contributions with proper affiliations
- Abstract
- Publication date
- Keywords
- DOI (
-
Body Structure
- Properly nested
<sec>elements with IDs - Section titles
- Proper table and figure formatting
- Properly nested
-
Back Matter
- References with unique IDs
- Acknowledgments
- Author contributions
- Funding information
The pipeline performs comprehensive PMC compliance checks:
- DTD version validation
- Required metadata presence
- Author affiliation structure
- Table positioning and caption placement
- Figure elements and captions
- Reference formatting
- Section ID attributes
- Special character encoding
Tables are formatted according to PMC requirements with enhanced professional styling:
PMC Compliance:
position="float"orposition="anchor"(not "top")- Caption as first child element
- Proper label for table numbers
- Minimal use of colspan/rowspan
Enhanced Professional Styling:
- Professional borders (#666) with subtle box shadows for depth
- Header row styling with light blue background (#e8f0f7) and accent border (#4a90d9)
- Alternating row colors (#f9f9f9) for improved readability
- Hover effects for interactive viewing
- Optimized padding (8px-10px) and tighter line-height (1.3) for clean presentation
- Smaller table font size (10pt) for better content fit
- Word-wrap handling for long content
- All styling preserves PMC/DTD compliance and does not alter content
Figures include enhanced sizing and alignment:
- Unique ID attributes
- Label elements for figure numbers
- Caption elements with descriptions
- Proper graphic references with XLink namespace
- Enhanced Sizing: Maximum width of 90% to prevent oversizing, maximum height of 500pt to prevent page overflow
- Aspect Ratio Preservation:
object-fit: containensures proper proportions - Professional Alignment: Centered with optimized margins for clean presentation
To ensure transparency and facilitate review, any text or elements added by the AI system specifically for DTD/PMC compliance are automatically highlighted in the generated HTML output.
- AI Marking: When the AI repair system adds content for compliance (e.g., mandatory DOI elements, journal metadata), it marks them with
data-compliance="true"attribute - Visual Highlighting: Marked content appears with:
- Light yellow background (#fff9e6)
- Orange left border (3px, #ff9900)
- Compliance icon (📋) prefix
Compliance text may include:
- Journal metadata elements added for PMC requirements
- DOI placeholders when not present in source document
- Abstract sections added for compliance
- Required front matter elements
- Structural elements needed for DTD validation
When reviewing the generated HTML:
- ✅ Yellow highlighted sections = Content added for DTD/PMC compliance
⚠️ Original content = Remains unhighlighted and unmodified- 📋 Icon indicates compliance-related additions
This feature allows you to:
- Easily identify what was added versus what was in the original document
- Review compliance additions before final submission
- Update highlighted sections with actual document-specific information
- Maintain transparency in the conversion process
Each conversion generates a complete package with enhanced professional styling:
- article.xml - JATS 1.4 Publishing DTD XML with xsi:schemaLocation (without DOCTYPE for XSD validation)
- articledtd.xml - JATS 1.4 Publishing DTD XML with DOCTYPE declaration (for PMC Style Checker validation)
- article.html - HTML version with enhanced styling:
- Optimized Margins: 0.75in vertical, 0.65in horizontal for better space utilization
- Professional Tables: Enhanced borders, colors, and spacing
- Enhanced Images: Proper sizing with max-width 90%, max-height 500pt, aspect ratio preservation
- Compliance Highlighting: Yellow background for compliance-added text
- media/ - All extracted images
- validation_report.json - Detailed validation report with:
- JATS schema validation results
- PMC compliance check results
- PMC Style Checker results (if available)
- Critical issues and warnings
- Document structure analysis
- PMC submission checklist
- README.txt - Package documentation
The validation report includes:
{
"jats_validation": {
"status": "PASS/FAIL",
"target_version": "JATS 1.4",
"official_schema": "https://public.nlm.nih.gov/projects/jats/publishing/1.4/"
},
"pmc_compliance": {
"status": "PASS/WARNING",
"reference": "https://pmc.ncbi.nlm.nih.gov/tagging-guidelines/article/style/",
"details": {
"critical_issues": [],
"warnings": [],
"issues_count": 0,
"warnings_count": 0
}
},
"pmc_stylechecker": {
"available": true,
"status": "PASS/FAIL",
"xslt_used": "nlm-style-5-0.xsl",
"error_count": 0,
"warning_count": 0,
"errors": [],
"warnings": []
},
"document_structure": {
"dtd_version": "1.4",
"article_type": "research-article",
"table_count": 5,
"figure_count": 3,
"reference_count": 25
},
"pmc_submission_checklist": [
"Validate with PMC Style Checker",
"Ensure all figures have alt text",
"Verify references are properly formatted",
...
]
}# Run all unit tests
pytest tests/ -v
# Run specific test suite
pytest tests/test_jats_generation.py -v
# Run with coverage report
pytest tests/ --cov=. --cov-report=html
# View coverage report
open htmlcov/index.html# Validate JATS XML against XSD schema
python -c "
from MasterPipeline import HighFidelityConverter
converter = HighFidelityConverter('document.docx')
converter.convert_to_jats()
converter.validate_jats()
"
# Run PMC Style Checker
cd pmc-stylechecker
xsltproc --path . nlm-style-5-0.xsl ../path/to/articledtd.xml
# Validate HTML with W3C standards (requires external tool)
# Install: npm install -g html-validator-cli
html-validator path/to/article.htmlThe pipeline generates 5 main output types:
-
JATS XML (XSD-Compliant):
article.xml- Validates against JATS 1.4 XSD schema
- No DOCTYPE declaration (optimized for schema validation)
- Contains xsi:schemaLocation for external validators
- Used for: Schema-based validation, XSD tools
-
JATS XML (PMC-Compliant):
articledtd.xml- Identical content to article.xml
- Includes DOCTYPE declaration for PMC Style Checker
- Compatible with DTD-based validators
- Used for: PMC Style Checker, PMC submission
-
HTML with Embedded Media:
article.html+media/- Semantic HTML5 output
- Images embedded from media/ folder
- CSS styling applied
- W3C HTML5 compliant
-
HTML for Display:
article.html- W3C HTML5 compliant
- Professional styling
- Embedded images
- Responsive design
# 1. Generate all outputs
python -c "
from MasterPipeline import HighFidelityConverter
converter = HighFidelityConverter('document.docx')
converter.run()
"
# 2. Review validation report
cat /tmp/output_files/validation_report.json
# 3. Check XSD validation
# Look for: jats_validation.status = "PASS"
# 4. Check PMC compliance
# Look for: pmc_compliance.status = "PASS" or "WARNING"
# 5. Run PMC Style Checker manually (if needed)
cd pmc-stylechecker
xsltproc --path . nlm-style-5-0.xsl /tmp/output_files/articledtd.xml
# 6. Review outputs
ls -lah /tmp/output_files/The web interface now supports asynchronous conversions with real-time progress updates:
Features:
- Drag-and-drop file upload with visual feedback
- Real-time progress bar showing conversion status
- Status polling for long-running conversions
- Download link appears when conversion completes
- Error handling with detailed error messages
API Endpoints:
POST /convert- Upload file, returns HTTP 202 with conversion_idGET /status/<conversion_id>- Poll conversion statusGET /download/<conversion_id>- Download completed package
Example Usage:
// Upload file
const formData = new FormData();
formData.append('file', file);
const response = await fetch('/convert', {
method: 'POST',
body: formData,
headers: {'Accept': 'application/json'}
});
const { conversion_id } = await response.json();
// Poll status
const statusResponse = await fetch(`/status/${conversion_id}`);
const status = await statusResponse.json();
// status includes: status, progress, message, etc.
// Download result when complete
window.location.href = `/download/${conversion_id}`;Generated JATS XML now includes xsi:schemaLocation attribute pointing to the public JATS XSD:
<article xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://jats.nlm.nih.gov/publishing/1.3/ https://jats.nlm.nih.gov/publishing/1.3/xsd/JATS-journalpublishing1-3.xsd"
dtd-version="1.4"
article-type="research-article">This allows external PMC Style Checker and other validators to resolve the schema without "DTD not found" errors.
The pipeline now integrates the PMC Style Checker XSLT bundle (nlm-style-5.47):
Setup:
# Download PMC style checker
./tools/fetch_pmc_style.sh
# Ensure xsltproc is installed
sudo apt-get install xsltproc # Ubuntu/Debian
brew install libxslt # macOS
apk add libxslt # Alpine/DockerOutput Files:
pmc_style_report.html- Detailed style check report with errors and warningsvalidation_report.json- Includes PMC style check results:{ "pmc_style_check": { "status": "completed", "report_file": "pmc_style_report.html", "errors_count": 0, "warnings_count": 5, "summary": "0 errors, 5 warnings" } }
Defensive Design:
- If
xsltprocis not installed, conversion continues with warning - If PMC style checker is not downloaded, conversion continues with warning
- Pipeline never fails due to missing optional tools
The current implementation uses an in-memory progress store, suitable for:
- Development environments
- Single-server deployments
- Low to moderate traffic
Limitations:
- Progress state lost on server restart
- Not suitable for multi-instance deployments
- Not suitable for load-balanced environments
For production deployments with multiple instances or load balancing:
Option 1: Redis-based Progress Store
import redis
redis_client = redis.Redis(host='localhost', port=6379, db=0)
# Store progress
redis_client.setex(
f"conversion:{conversion_id}",
3600, # 1 hour TTL
json.dumps(progress_data)
)
# Retrieve progress
progress_json = redis_client.get(f"conversion:{conversion_id}")
progress_data = json.loads(progress_json) if progress_json else NoneOption 2: Job Queue System (Celery, RQ, etc.)
from celery import Celery
app = Celery('omnijax', broker='redis://localhost:6379/0')
@app.task(bind=True)
def convert_document(self, docx_path, conversion_id):
# Update progress via self.update_state()
self.update_state(state='PROGRESS', meta={'progress': 50})
# ... conversion logic ...Option 3: Database-backed Progress Store
# Using SQLAlchemy or similar ORM
class ConversionJob(db.Model):
id = db.Column(db.String, primary_key=True)
status = db.Column(db.String)
progress = db.Column(db.Integer)
message = db.Column(db.String)
created_at = db.Column(db.DateTime)Cloud Run Considerations:
- Use Cloud Tasks or Pub/Sub for background jobs
- Store progress in Cloud Firestore or Cloud SQL
- Use Cloud Storage for output files
- Set appropriate timeouts for long-running conversions
To test the new async UI and PMC style check:
-
Start the server:
python app.py
-
Open browser to http://localhost:8080
-
Upload a DOCX file:
- Drag and drop or click to select
- Watch progress bar update in real-time
- Download package when complete
-
Check output package:
pmc_style_report.html- Style check results (if xsltproc available)validation_report.json- Includes pmc_style_check sectionarticle.xml- Now includes xsi:schemaLocation for external validators
-
Validate with external PMC Style Checker:
- Upload
articledtd.xmlto https://pmc.ncbi.nlm.nih.gov/tools/stylechecker/ - articledtd.xml includes DOCTYPE declaration required by PMC Style Checker
- Should not see "DTD not found" errors
- Should validate successfully
- Upload
Progress bar not updating:
- Check browser console for JavaScript errors
- Verify
/status/<conversion_id>endpoint is accessible - Check server logs for conversion errors
PMC style check not running:
- Verify xsltproc is installed:
which xsltproc - Verify XSLT file exists:
ls -l tools/pmc_style/nlm-stylechecker.xsl - Run
./tools/fetch_pmc_style.shif missing - Check server logs for warnings
External validator errors:
- Verify
xsi:schemaLocationis in article.xml - Check that namespace declarations are present
- Validate XML is well-formed:
xmllint --noout article.xml
The tools/add_doctype.py utility script can be used to add DOCTYPE declarations to existing JATS XML files:
# Add DOCTYPE to article.xml and save as articledtd.xml (JATS 1.4)
python tools/add_doctype.py article.xml
# Specify custom output path
python tools/add_doctype.py article.xml -o output/article_with_dtd.xml
# Specify JATS version 1.3
python tools/add_doctype.py article.xml -v 1.3
# Full example with all options
python tools/add_doctype.py input/article.xml --output output/articledtd.xml --version 1.4When to use:
- When you need to validate an existing XML file with PMC Style Checker
- When you have article.xml without DOCTYPE and need to add it
- When you need a specific JATS version DOCTYPE (supports 1.0-1.4)
Note: The MasterPipeline automatically generates both article.xml (without DOCTYPE) and articledtd.xml (with DOCTYPE) during conversion, so you typically don't need to run this script manually.
- JATS Official Site: https://jats.nlm.nih.gov/
- JATS 1.4 Publishing DTD: https://public.nlm.nih.gov/projects/jats/publishing/1.4/
- PMC Tagging Guidelines: https://pmc.ncbi.nlm.nih.gov/tagging-guidelines/article/style/
- PMC Style Checker: https://pmc.ncbi.nlm.nih.gov/tools/stylechecker/
- NLM PMC: https://pmc.ncbi.nlm.nih.gov/
This project is actively maintained and improved. Recent updates include:
- Documentation Enhancement: Removed legacy PDF generation references to better reflect current capabilities
- Validation Reports: Added comprehensive validation report files to output packages for better transparency
- UI Improvements: Enhanced table formatting with zebra striping for improved readability
- User Experience: Streamlined documentation to focus on current features
Future enhancements may include additional output formats, enhanced validation capabilities, and improved AI-powered content repair features.
Proprietary - OmniJAX Professional JATS Converter
Document Version: 2.0 - Improved for accessibility Last Updated: January 2024 Target Audience: Non-technical and technical users