-
Notifications
You must be signed in to change notification settings - Fork 307
Description
Version
26.1.2
Which installation method(s) does this occur on?
No response
Describe the bug.
For STRUCTURED content (charts and tables), content_replace is True, so the text field is set to location (the source_location string like [CHART] source=... page=...) instead of content (the actual extracted text from table_metadata.table_content).
For structured content (tables and charts), the server writes the source_location string as the parquet text field instead of the actual extracted content. The source_location is the [CHART] source=... page=... or [TABLE] source=... page=... placeholder.
The nv-ingest server's StoreEmbedTask intentionally replaces structured content text with source_location (a placeholder string) in the parquet output. The actual table/chart text is in metadata["content"] but gets swapped out for location when doc_type is STRUCTURED.
[Code]:
nv-ingest server's embed_text_upload.py:
writer.append_row(
{
"text": location if content_replace else content,
"source": metadata["source_metadata"],
"content_metadata": metadata["content_metadata"],
"vector": metadata["embedding"],
}
)
Where:
content_replace: bool = doc_type in [ContentTypeEnum.IMAGE, ContentTypeEnum.STRUCTURED]
location: str = metadata["source_metadata"]["source_location"]
content = metadata["content"]
Minimum reproducible example
Upload any document with tables/charts using V2 api'sRelevant log output
Other/Misc.
No response