Bug
MinioDocumentRepository.get_many() fetches unique content hashes in batch (step 3), then splices each document's metadata with its content stream (step 4). When two or more documents share the same content_multihash, they are assigned the same ContentStream object from content_results.
The first consumer that calls .read() on the stream exhausts it. All subsequent consumers get b"".
Reproduction
Upload the same file twice (producing two documents with identical content). Call get_many with both document IDs. The second document will have empty content_bytes after conversion.
This was observed in RBA when two spec sheet uploads of the same PDF produced two products with identical content_multihash. The first product listed had correct content; the second had content_bytes: "" and all parsed fields (id, name, description) as null.
Root cause
In get_many (around step 3-4):
# Step 3: one ContentStream per unique hash
content_results = self.get_many_binary_objects(...)
# Step 4: multiple documents can share the same stream object
content_stream = content_results[content_multihash] # shared reference
Fix
Each document should receive its own independent stream. Options:
- Read the binary data once per hash, then wrap in a fresh
BytesIO/ContentStream for each document that references it
- Or seek to 0 before returning each stream (works if the underlying stream is seekable)
Bug
MinioDocumentRepository.get_many()fetches unique content hashes in batch (step 3), then splices each document's metadata with its content stream (step 4). When two or more documents share the samecontent_multihash, they are assigned the sameContentStreamobject fromcontent_results.The first consumer that calls
.read()on the stream exhausts it. All subsequent consumers getb"".Reproduction
Upload the same file twice (producing two documents with identical content). Call
get_manywith both document IDs. The second document will have emptycontent_bytesafter conversion.This was observed in RBA when two spec sheet uploads of the same PDF produced two products with identical
content_multihash. The first product listed had correct content; the second hadcontent_bytes: ""and all parsed fields (id,name,description) asnull.Root cause
In
get_many(around step 3-4):Fix
Each document should receive its own independent stream. Options:
BytesIO/ContentStreamfor each document that references it