[20230413] Weekly VLM2 - VisualGPT

Paper

[VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning](https://arxiv.org/abs/2102.10407)


**Summary**
대규모 데이터셋을 이용한 Pre-trained language model(PLM)을 fine-tune 하는 방식으로 
이미지 캡셔닝 task 모델을 만든 논문 입니다.
비교적 작은 데이터 셋으로 학습을 하여도 높은 성능을 낼 수 있다는 장점이 있습니다. 

아키텍처를 살펴보면
**그림1**
![hsahsdhs](https://user-images.githubusercontent.com/39431030/232377359-90848017-4f75-4968-b79c-a1f762f6d07c.PNG)
**그림2**
![fjsdjd](https://user-images.githubusercontent.com/39431030/232377377-4ef11861-dbbd-464f-bd11-ac9102f87199.PNG)

 우선 그림1과 같이 이미지를 인풋으로 받는 encoder가 있습니다. 해당 encoder에서는 이미지의 정보를  2048크기의 벡터로 변환 시켜줍니다. 이러한 이미지 정보를 담은 encoder의 output과 PLM의 기존 텍스트 정보가 decoder의 인풋으로 들어갑니다.  이미지, 텍스트 정보를 cross attention하고, Self-Resurrecting Activation Unit (SRAU)에서 cross attention의 결과를 통해 기존의 가중치를 업데이트 하는 과정을 거칩니다. 이러한 과정을 거치면서 디코더에서는 이미지에 대한 캡션을 학습 합니다.  loss fucntion은 Cross-Entropy Loss 입니다.

**학습 방법**
먼저 pre train과정을 거칩니다.
PLM 자체가 거대한 데이터 셋을 이용한 pre train 모델이지만 
이미지 캡션 테스트에 맞게 Conceptual Captions 와 같은 데이터 셋으로 pre train 해줍니다.
이렇게 pre train된 모델을 가지고 각각의 데이터 셋에 대한 fine tuning으로 모델을 개발하게 됩니다.


Speaker
@joosun7l 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[20230413] Weekly VLM2 - VisualGPT #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[20230413] Weekly VLM2 - VisualGPT #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions