Skip to content

get_many returns shared ContentStream for documents with identical content #124

@absoludity

Description

@absoludity

Bug

MinioDocumentRepository.get_many() fetches unique content hashes in batch (step 3), then splices each document's metadata with its content stream (step 4). When two or more documents share the same content_multihash, they are assigned the same ContentStream object from content_results.

The first consumer that calls .read() on the stream exhausts it. All subsequent consumers get b"".

Reproduction

Upload the same file twice (producing two documents with identical content). Call get_many with both document IDs. The second document will have empty content_bytes after conversion.

This was observed in RBA when two spec sheet uploads of the same PDF produced two products with identical content_multihash. The first product listed had correct content; the second had content_bytes: "" and all parsed fields (id, name, description) as null.

Root cause

In get_many (around step 3-4):

# Step 3: one ContentStream per unique hash
content_results = self.get_many_binary_objects(...)

# Step 4: multiple documents can share the same stream object
content_stream = content_results[content_multihash]  # shared reference

Fix

Each document should receive its own independent stream. Options:

  • Read the binary data once per hash, then wrap in a fresh BytesIO/ContentStream for each document that references it
  • Or seek to 0 before returning each stream (works if the underlying stream is seekable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions