Skip to content

Possibility on application to LaTeX-OCR (image of arbitrary size) #12

@alephpi

Description

@alephpi

Hi, thank you for this amazing work!

I think code generation based on syntax tree is more natural than the left-to-right linear generation. And do you think it's possible to apply your ideas to do mathematical expression recognition, i.e. LaTeX-OCR?

There are already several solutions there, but basically the pipeline is to first using a vision encoder to get vision tokens for the image and then put them into a VLM decoder to do typical autoregressive text/code generation, without using any syntactical information, e.g. there may be un-paired curly brackets in the output.

So I wonder have you think about such application? From my perspective, there may be one major difficulty: the CSG2D program in the paper produces a relatively regular size image (a square) without too much resizing issue to concern, while a LaTeX-rendered image maybe arbitrarily long, which may be hard for the value network to estimate the program edit distance in a consistent way.

Have you explore the influence of such resizing issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions