Skip to content

neu-vi/SocialFusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

arXiv Paper

Hamza Tahboub1, Weiyan Shi1, Gang Hua2, Huaizu Jiang1
1 Northeastern University, 2 Amazon.com, Inc.

Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.

Installation

pip install -r requirements.txt

Dataset Setup

Download datasets from their official sources and place them under data/datasets.

Dataset Source
AffectNet https://www.mohammadmahoor.com/pages/databases/affectnet/
GazeFollow http://gazefollow.csail.mit.edu/
HaGRIDv2 https://github.com/hukenovs/hagrid
LAM https://ego4d-data.org/
PISC https://zenodo.org/record/1059155

Training & Validation

python train.py \
    --train_on affectnet gazefollow hagrid lam pisc \
    --val_on affectnet gazefollow hagrid lam pisc \
    --use_heatmap \
    --wandb

Citation

@article{
  tahboub2026socialfusion,
  title={SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models},
  author={Hamza Tahboub and Weiyan Shi and Gang Hua and Huaizu Jiang},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2026},
  url={https://openreview.net/forum?id=ofYhEoKIEx}
}

About

Addressing Social Degradation in Pre-trained Vision-Language Models (TMLR 2026)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages