Context
We are using MedGemma 1.5 on a medical imaging task where each sample consists of 5 correlated images from the same patient (breast thermography: frontal, left/right oblique, left/right lateral).
Current MedGemma documentation and examples focus on single-image inputs, so we are seeking confirmation of the ideal and recommended adaptation strategy for this multi-view setting.
Our Understanding of the Ideal Approach
For a fixed multi-view medical imaging problem (5 views per case, ~3,000 cases), the most appropriate approach appears to be:
Late Fusion (Feature-Level Fusion)
- Encode each view independently using the MedGemma (or MedSigLIP) image encoder with shared weights
- Fuse per-view embeddings using concatenation, attention, or a small transformer
- Train a lightweight task-specific head on top of the fused representation
This preserves per-view semantics, scales well, and aligns with standard practice in multi-view medical imaging literature.
Alternatives (Less Ideal)
- Image montage (early fusion): simple but loses per-view structure and resolution
- Multi-image prompt-only fusion: possible for exploration, but unclear whether the vision encoder is designed to jointly reason over multiple images in a single request
Questions
- Is feature-level late fusion the recommended pattern for multi-view medical imaging with MedGemma?
- Can the MedGemma image encoder be reliably used as a frozen feature extractor for this setup?
- Are there reference examples, benchmarks, or internal guidance for multi-image medical use cases?
Use Case Summary
- Task: Breast cancer detection from thermography
- Input: 5 fixed views per patient
- Dataset size: ~3,000 cases
- Output: Binary classification + localization
Any confirmation or guidance on this would help ensure correct and safe use of MedGemma in multi-view medical workflows.
Thank you.
Context
We are using MedGemma 1.5 on a medical imaging task where each sample consists of 5 correlated images from the same patient (breast thermography: frontal, left/right oblique, left/right lateral).
Current MedGemma documentation and examples focus on single-image inputs, so we are seeking confirmation of the ideal and recommended adaptation strategy for this multi-view setting.
Our Understanding of the Ideal Approach
For a fixed multi-view medical imaging problem (5 views per case, ~3,000 cases), the most appropriate approach appears to be:
Late Fusion (Feature-Level Fusion)
This preserves per-view semantics, scales well, and aligns with standard practice in multi-view medical imaging literature.
Alternatives (Less Ideal)
Questions
Use Case Summary
Any confirmation or guidance on this would help ensure correct and safe use of MedGemma in multi-view medical workflows.
Thank you.