Skip to content

feat: render LaTeX for LLMs#121

Merged
Moskize91 merged 3 commits intomainfrom
expr
Feb 4, 2026
Merged

feat: render LaTeX for LLMs#121
Moskize91 merged 3 commits intomainfrom
expr

Conversation

@Moskize91
Copy link
Contributor

No description provided.

@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

Caution

Review failed

The head commit changed during the review from 7b2c557 to 0d5d77d.

Walkthrough

This pull request removes the XSLT-based MathML→LaTeX conversion assets under epub_translator/data/mmltex (multiple XSL files and a README), deletes the utility function unwrap_parents from epub_translator/segment/utils.py, adds mathml2latex as a dependency in pyproject.toml, and updates epub_translator/translation/xml_interrupter.py to render buffered MathML as LaTeX via a new _render_latex helper that uses mathml2latex and BeautifulSoup.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • oomol-lab/epub-translator PR 54: Also modifies MathML→LaTeX conversion support and touches the same mmltex XSLT assets.
  • oomol-lab/epub-translator PR 120: Also changes epub_translator/translation/xml_interrupter.py and buffered math/text merging behavior.
🚥 Pre-merge checks | ✅ 1 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive No description was provided by the author, making it impossible to assess whether the description relates to the changeset. Add a pull request description explaining the motivation for rendering LaTeX, implementation details, and any related context about MathML removal.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title follows the required format <type>(<scope>): <subject> with 'feat' as type and 'render LaTeX for LLMs' as subject, and accurately describes the main change (adding LaTeX rendering capability).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@epub_translator/translation/xml_interrupter.py`:
- Around line 153-154: The MathML string mathml_str is pure XML and is being
parsed with BeautifulSoup using the HTML parser; update the BeautifulSoup call
that creates soup so it uses an XML parser (e.g., "xml" or "lxml-xml") instead
of "html.parser" to preserve namespaces and proper XML handling before passing
soup into process_mathml.
🧹 Nitpick comments (2)
pyproject.toml (1)

36-36: Inconsistent dependency format with parentheses.

The other dependencies in this file use the standard format without parentheses (e.g., "tiktoken>=0.12.0,<1.0.0"). For consistency, consider removing the parentheses and spaces.

♻️ Suggested fix
-    "mathml2latex (>=0.2.12,<0.3.0)",
+    "mathml2latex>=0.2.12,<0.3.0",
epub_translator/translation/xml_interrupter.py (1)

151-156: Consider logging the exception for debugging purposes.

Silently swallowing exceptions with except Exception: pass makes it difficult to diagnose issues when LaTeX conversion fails. While the fallback to raw text is reasonable, logging would help identify problematic MathML inputs.

🔧 Suggested improvement
+import logging
+
+logger = logging.getLogger(__name__)
+
 # ... in _render_latex method:
         try:
             mathml_str = tostring(math_element, encoding="unicode")
             soup = BeautifulSoup(mathml_str, "html.parser")
             latex = process_mathml(soup)
-        except Exception:
-            pass
+        except Exception as e:
+            logger.debug("Failed to convert MathML to LaTeX: %s", e)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@epub_translator/translation/xml_interrupter.py`:
- Around line 160-167: The inline/block detection currently treats a missing
display attribute as block because it checks only for display == "inline";
update the logic in the function handling math elements (the code that uses
math_element.get("display", None), latex and text_segments) to explicitly treat
"block" as the only block case — i.e., if math_element.get("display") == "block"
use block delimiters ($$...$$), otherwise use inline delimiters ($...$) so that
omitted display defaults to inline per the MathML spec.
🧹 Nitpick comments (1)
epub_translator/translation/xml_interrupter.py (1)

153-158: Avoid silently swallowing exceptions.

The bare except Exception: pass makes debugging difficult when conversion fails. At minimum, log the exception to aid troubleshooting.

🛠️ Suggested improvement
+import logging
+
+_logger = logging.getLogger(__name__)
+
 # ... in _render_latex method:
         try:
             mathml_str = tostring(math_element, encoding="unicode")
             soup = BeautifulSoup(mathml_str, "html.parser")
             latex = process_mathml(soup)
-        except Exception:
-            pass
+        except Exception as e:
+            _logger.debug("MathML to LaTeX conversion failed: %s", e)

Comment on lines +160 to +167
if latex is None:
latex = "".join(t.text for t in text_segments)
elif math_element.get("display", None) == "inline":
latex = f"${latex}$"
else:
latex = f"$${latex}$$"

return f" {latex} "
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

MathML math element display attribute default value block inline

💡 Result:

For the MathML <math> element, the display attribute accepts inline or block, and if the attribute is omitted the default is inline. [1][7]


🏁 Script executed:

cat -n epub_translator/translation/xml_interrupter.py | sed -n '155,175p'

Repository: oomol-lab/epub-translator

Length of output: 898


🏁 Script executed:

# Search for other MathML handling in the codebase to understand the full context
rg -n "display" epub_translator/translation/xml_interrupter.py

Repository: oomol-lab/epub-translator

Length of output: 336


🏁 Script executed:

# Check if there are test cases or other context about how math elements are processed
rg -n "math_element" epub_translator/translation/xml_interrupter.py | head -20

Repository: oomol-lab/epub-translator

Length of output: 559


Correct the inline math detection logic.

The code incorrectly treats missing display attributes as block math. Per the MathML specification, when the display attribute is omitted, it defaults to inline. The current logic checks display == "inline" explicitly; when the attribute is absent, get() returns None, which fails the condition and incorrectly wraps the math in $$...$$ (block delimiters) instead of $...$ (inline delimiters).

Invert the logic to check for "block" explicitly:

Suggested fix
-        elif math_element.get("display", None) == "inline":
-            latex = f"${latex}$"
-        else:
+        elif math_element.get("display", None) == "block":
             latex = f"$${latex}$$"
+        else:
+            latex = f"${latex}$"
🤖 Prompt for AI Agents
In `@epub_translator/translation/xml_interrupter.py` around lines 160 - 167, The
inline/block detection currently treats a missing display attribute as block
because it checks only for display == "inline"; update the logic in the function
handling math elements (the code that uses math_element.get("display", None),
latex and text_segments) to explicitly treat "block" as the only block case —
i.e., if math_element.get("display") == "block" use block delimiters ($$...$$),
otherwise use inline delimiters ($...$) so that omitted display defaults to
inline per the MathML spec.

@Moskize91 Moskize91 merged commit d58cc96 into main Feb 4, 2026
3 checks passed
@Moskize91 Moskize91 deleted the expr branch February 4, 2026 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant