Skip to content

Review the serialization of tables in triplet format #528

@ceberam

Description

@ceberam

The TripletTableSerializer aims at serializing tables to a text format that could be understood by LLMs for downstream applications like RAG.
The implementation connects a column, a row, and the content of the cell in the given column and row to create sentences of the style A, B = C.
The implementation needs to be reviewed, since it takes assumptions on the role of the first column and the first row to build the triplet representation of the table cells.

For instance, the inner table in test/data/doc/rich_table.gt.html:

Image

will be serialized as follows (refer to the ground truth file 0c_out_chunks.json:

cell 0,0, 1 = cell 0,1. cell 1,0, 1 = <em><p>text in italic</p></em>. <ul>\n<li>list item 1</li>\n<li>list item 2</li>\n</ul>, 1 = cell 2,1. cell 3,0, 1 = inner cell 0,0, 1 = inner cell 0,1. inner cell 0,0, 2 = inner cell 0,2. inner cell 1,0, 1 = inner cell 1,1. inner cell 1,0, 2 = inner cell 1,2. <p>Some text in a generic group.</p>\n<p>More text in the group.</p>, 1 = cell 4,1

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggood first issueGood for newcomers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions