SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

Hamza Tahboub¹, Weiyan Shi¹, Gang Hua², Huaizu Jiang¹
¹ Northeastern University, ² Amazon.com, Inc.

Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.

Installation

pip install -r requirements.txt

Dataset Setup

Download datasets from their official sources and place them under data/datasets.

Dataset	Source
AffectNet	https://www.mohammadmahoor.com/pages/databases/affectnet/
GazeFollow	http://gazefollow.csail.mit.edu/
HaGRIDv2	https://github.com/hukenovs/hagrid
LAM	https://ego4d-data.org/
PISC	https://zenodo.org/record/1059155

Training & Validation

python train.py \
    --train_on affectnet gazefollow hagrid lam pisc \
    --val_on affectnet gazefollow hagrid lam pisc \
    --use_heatmap \
    --wandb

Citation

@article{
  tahboub2026socialfusion,
  title={SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models},
  author={Hamza Tahboub and Weiyan Shi and Gang Hua and Huaizu Jiang},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2026},
  url={https://openreview.net/forum?id=ofYhEoKIEx}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

Installation

Dataset Setup

Training & Validation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

Installation

Dataset Setup

Training & Validation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages