-
Notifications
You must be signed in to change notification settings - Fork 147
Open
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers
Description
The TripletTableSerializer aims at serializing tables to a text format that could be understood by LLMs for downstream applications like RAG.
The implementation connects a column, a row, and the content of the cell in the given column and row to create sentences of the style A, B = C.
The implementation needs to be reviewed, since it takes assumptions on the role of the first column and the first row to build the triplet representation of the table cells.
For instance, the inner table in test/data/doc/rich_table.gt.html:
will be serialized as follows (refer to the ground truth file 0c_out_chunks.json:
cell 0,0, 1 = cell 0,1. cell 1,0, 1 = <em><p>text in italic</p></em>. <ul>\n<li>list item 1</li>\n<li>list item 2</li>\n</ul>, 1 = cell 2,1. cell 3,0, 1 = inner cell 0,0, 1 = inner cell 0,1. inner cell 0,0, 2 = inner cell 0,2. inner cell 1,0, 1 = inner cell 1,1. inner cell 1,0, 2 = inner cell 1,2. <p>Some text in a generic group.</p>\n<p>More text in the group.</p>, 1 = cell 4,1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers