Conversation
WalkthroughThis pull request removes the XSLT-based MathML→LaTeX conversion assets under epub_translator/data/mmltex (multiple XSL files and a README), deletes the utility function unwrap_parents from epub_translator/segment/utils.py, adds mathml2latex as a dependency in pyproject.toml, and updates epub_translator/translation/xml_interrupter.py to render buffered MathML as LaTeX via a new _render_latex helper that uses mathml2latex and BeautifulSoup. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@epub_translator/translation/xml_interrupter.py`:
- Around line 153-154: The MathML string mathml_str is pure XML and is being
parsed with BeautifulSoup using the HTML parser; update the BeautifulSoup call
that creates soup so it uses an XML parser (e.g., "xml" or "lxml-xml") instead
of "html.parser" to preserve namespaces and proper XML handling before passing
soup into process_mathml.
🧹 Nitpick comments (2)
pyproject.toml (1)
36-36: Inconsistent dependency format with parentheses.The other dependencies in this file use the standard format without parentheses (e.g.,
"tiktoken>=0.12.0,<1.0.0"). For consistency, consider removing the parentheses and spaces.♻️ Suggested fix
- "mathml2latex (>=0.2.12,<0.3.0)", + "mathml2latex>=0.2.12,<0.3.0",epub_translator/translation/xml_interrupter.py (1)
151-156: Consider logging the exception for debugging purposes.Silently swallowing exceptions with
except Exception: passmakes it difficult to diagnose issues when LaTeX conversion fails. While the fallback to raw text is reasonable, logging would help identify problematic MathML inputs.🔧 Suggested improvement
+import logging + +logger = logging.getLogger(__name__) + # ... in _render_latex method: try: mathml_str = tostring(math_element, encoding="unicode") soup = BeautifulSoup(mathml_str, "html.parser") latex = process_mathml(soup) - except Exception: - pass + except Exception as e: + logger.debug("Failed to convert MathML to LaTeX: %s", e)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@epub_translator/translation/xml_interrupter.py`:
- Around line 160-167: The inline/block detection currently treats a missing
display attribute as block because it checks only for display == "inline";
update the logic in the function handling math elements (the code that uses
math_element.get("display", None), latex and text_segments) to explicitly treat
"block" as the only block case — i.e., if math_element.get("display") == "block"
use block delimiters ($$...$$), otherwise use inline delimiters ($...$) so that
omitted display defaults to inline per the MathML spec.
🧹 Nitpick comments (1)
epub_translator/translation/xml_interrupter.py (1)
153-158: Avoid silently swallowing exceptions.The bare
except Exception: passmakes debugging difficult when conversion fails. At minimum, log the exception to aid troubleshooting.🛠️ Suggested improvement
+import logging + +_logger = logging.getLogger(__name__) + # ... in _render_latex method: try: mathml_str = tostring(math_element, encoding="unicode") soup = BeautifulSoup(mathml_str, "html.parser") latex = process_mathml(soup) - except Exception: - pass + except Exception as e: + _logger.debug("MathML to LaTeX conversion failed: %s", e)
| if latex is None: | ||
| latex = "".join(t.text for t in text_segments) | ||
| elif math_element.get("display", None) == "inline": | ||
| latex = f"${latex}$" | ||
| else: | ||
| latex = f"$${latex}$$" | ||
|
|
||
| return f" {latex} " |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
MathML math element display attribute default value block inline
💡 Result:
For the MathML <math> element, the display attribute accepts inline or block, and if the attribute is omitted the default is inline. [1][7]
🏁 Script executed:
cat -n epub_translator/translation/xml_interrupter.py | sed -n '155,175p'Repository: oomol-lab/epub-translator
Length of output: 898
🏁 Script executed:
# Search for other MathML handling in the codebase to understand the full context
rg -n "display" epub_translator/translation/xml_interrupter.pyRepository: oomol-lab/epub-translator
Length of output: 336
🏁 Script executed:
# Check if there are test cases or other context about how math elements are processed
rg -n "math_element" epub_translator/translation/xml_interrupter.py | head -20Repository: oomol-lab/epub-translator
Length of output: 559
Correct the inline math detection logic.
The code incorrectly treats missing display attributes as block math. Per the MathML specification, when the display attribute is omitted, it defaults to inline. The current logic checks display == "inline" explicitly; when the attribute is absent, get() returns None, which fails the condition and incorrectly wraps the math in $$...$$ (block delimiters) instead of $...$ (inline delimiters).
Invert the logic to check for "block" explicitly:
Suggested fix
- elif math_element.get("display", None) == "inline":
- latex = f"${latex}$"
- else:
+ elif math_element.get("display", None) == "block":
latex = f"$${latex}$$"
+ else:
+ latex = f"${latex}$"🤖 Prompt for AI Agents
In `@epub_translator/translation/xml_interrupter.py` around lines 160 - 167, The
inline/block detection currently treats a missing display attribute as block
because it checks only for display == "inline"; update the logic in the function
handling math elements (the code that uses math_element.get("display", None),
latex and text_segments) to explicitly treat "block" as the only block case —
i.e., if math_element.get("display") == "block" use block delimiters ($$...$$),
otherwise use inline delimiters ($...$) so that omitted display defaults to
inline per the MathML spec.
No description provided.