koclip-multimodal-consistency

Reproducible baselines for Korean image–hashtag consistency using CLIP/koCLIP.
This repository accompanies the paper:

Predicting Semantic Consistency between Images and Korean Hashtags on Instagram Jiyoon Oh & Jangmin Oh, School of AI Convergence, Sungshin Women's University

Description

This repository investigates semantic consistency modeling between images and Korean hashtags, addressing the limitations of generic multimodal models in social media environments. On image-centric social media platforms such as Instagram, user-generated hashtags often reflect subjective feelings, slang, or abstract concepts rather than literal descriptions of visual content. This creates a substantial semantic gap that general-purpose multi-modal models (e.g., CLIP) struggle to capture.

This project presents training and evaluation scripts for a five-class semantic consistency prediction task (scores 1–5), which quantifies the alignment between images and Korean hashtags.

We compare three progressive training strategies:

Similarity-based baselines
Frozen-backbone classifiers
End-to-end fine-tuning

Experiments are conducted using both CLIP and KoCLIP backbones.

Repository Structure

koclip-multimodal-consistency/
│
├── README.md
├── requirements.txt
│
├── data/
│ ├── combined_data.csv # Annotated image–hashtag pairs
│ ├── categorical_codebook.csv # Mapping table for categorical labels (Score 1–5 definitions)
│ └── English-language-codebook.md # Full English data dictionary (variable & label schema)
│
├── model/
│ ├── baseline/
│ │ ├── baseline_1_1_clip.py # CLIP similarity-based baseline
│ │ ├── baseline_1_2_koclip.py # KoCLIP similarity-based baseline
│ │ ├── baseline_2_1_clip.py # CLIP (frozen backbone + classifier head)
│ │ └── baseline_2_2_koclip.py # KoCLIP (frozen backbone + classifier head)
│ │
│ ├── FT_clip.py # CLIP end-to-end fine-tuning
│ └── FT_koclip.py # KoCLIP end-to-end fine-tuning
│
└── results/ # Saved checkpoints, metrics, and plots

Requirments

Create a virtual environment (macOS):

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Minimal requirements.txt:

torch
transformers
pandas
numpy
pillow
tqdm
scikit-learn
matplotlib
seaborn

Dataset Information

Overview

The dataset consists of approximately 2,000 image–hashtag pairs collected from public Korean Instagram posts.

Posts were gathered across nine hashtag categories representing lifestyle, technology, and seasonal trends:

Korean Hashtag	English Translation
#국내여행	Domestic Travel
#ai	AI
#가을	Autumn
#제로칼로리	Zero Calories
#맛집	Good Restaurant
#영화	Movie
#올리브영	Olive Young (cosmetics brand)
#패션	Fashion
#iphone16	iPhone 16

For each of the nine seed hashtags, the 20 most recent public posts were collected using a Python Selenium-based web crawler, resulting in 180 unique posts.

From each post, the primary image and all associated hashtags were extracted. After removing duplicates and invalid entries, this process yielded approximately 2,000 valid image–hashtag pairs.

Data File Format

The main data file data/combined_data.csv contains the following columns:

Column	Type	Description
`Image Filename`	string	Filename of the image (e.g., `abc.jpg`). Identical numeric identifiers correspond to the same image instance.
`Hashtag`	string	Korean hashtag text associated with the image
`Score`	integer (1–5)	Semantic consistency score between the image and hashtag
`Hashtag_en`	string	English translation of the original `Hashtag` column. Provided for interpretability and reference purposes only.

Important Note:
The experimental models described in this study were trained using the original Hashtag column.
The Hashtag_en column is included solely to help non-Korean-speaking readers understand the semantic meaning of each hashtag and does not affect the training pipeline.

Consistency Score Codebook

Each image–hashtag pair is assigned a semantic consistency score on a five-point Likert scale, annotated using OpenAI GPT-4 Structured Output.

Score	Label	Description
1	Very Low	The hashtag has no semantic connection to the image content
2	Low	The hashtag has minimal or tangential relevance to the image
3	Medium	The hashtag is partially related but not strongly aligned with the image
4	High	The hashtag is clearly relevant and semantically consistent with the image
5	Very High	The hashtag perfectly describes or complements the image content

To assess reliability, a subset of samples was manually validated. The agreement between human annotations and GPT-based labels achieved Cohen's Kappa (κ = 0.68), indicating substantial agreement.

Image Files

Due to copyright and privacy considerations, the raw image files are not directly included in this repository.

The image files used in this study are available at:

🔗 Download Image Dataset (Google Drive)

After downloading, place the image files under: data/image/ and and update the image path in the training scripts accordingly.

Code Information

All scripts are located in model/baseline/ and implement three progressive training strategies:

Script	Model	Strategy	Description
`baseline_1_1_clip.py`	CLIP	Similarity-based	Cosine similarity with fixed thresholds
`baseline_1_2_koclip.py`	KoCLIP	Similarity-based	Cosine similarity with fixed thresholds
`baseline_2_1_clip.py`	CLIP	Frozen backbone + classifier	Frozen encoder with trainable MLP head
`baseline_2_2_koclip.py`	KoCLIP	Frozen backbone + classifier	Frozen encoder with trainable MLP head
`FT_clip.py`	CLIP	End-to-end fine-tuning	Full model fine-tuning with classifier head
`FT_koclip.py`	KoCLIP	End-to-end fine-tuning	Full model fine-tuning with classifier head

Requirements

Python 3.11 is recommended. Create a virtual environment and install dependencies:

python -m venv .venv
source .venv/bin/activate        # macOS / Linux
# .venv\Scripts\activate         # Windows
pip install -r requirements.txt

Dependencies

torch
transformers
pandas
numpy
pillow
tqdm
scikit-learn
matplotlib
seaborn

Hardware

All experiments were conducted on an NVIDIA TITAN RTX GPU (24 GB VRAM). Classification-based models were trained for up to 20 epochs with a batch size of 32.

Usage Instructions

Prepare the dataset: Place your images under data/image/ and ensure data/combined_data.csv contains the correct filenames.
Adjust file paths: Edit the data paths inside each script to match your local directory structure.
Run a baseline:

# Similarity baseline (threshold search)
python model/baseline/baseline_1_2_koclip.py

# Frozen backbone + classifier
python model/baseline/baseline_2_2_koclip.py

# End-to-end fine-tuning
python model/baseline/FT_koclip.py

Results:
- Dataset split: 80/20 (train/test)
- Metrics printed: Accuracy, Precision, Recall, F1-score
- Confusion matrices saved as image files
- Training curves exported as training_results.png
- Fine-tuned weights saved as .pth files

Methodology

Data Collection: Korean Instagram posts collected via Selenium-based web crawler across nine hashtag categories.
Automated Labeling: GPT-4 Structured Output was used to assign five-level semantic consistency scores. Human validation achieved Cohen’s Kappa (κ = 0.68), indicating substantial agreement.
Model Training: Three progressive strategies evaluate similarity-based baselines, frozen backbone with trainable classifier head, and end-to-end fine-tuning using both CLIP (clip-vit-base-patch32) and KoCLIP (Bingsu/clip-vit-large-patch14-ko) backbones.
Evaluation: 80/20 train/test split with cross-entropy loss, Adam optimizer (lr = 5e-5), and early stopping.

For full details, please refer to the accompanying paper.

Citation

If you use this repository in academic work, please cite:

@article{oh2025koclip,
  author  = {Oh, Jiyoon and Oh, Jangmin},
  title   = {Predicting Semantic Consistency between Images and Korean Hashtags on Instagram},
  journal = {PeerJ Computer Science},
  year    = {2025},
  url     = {https://github.com/askjiyun/koclip-multimodal-consistency}
}

Acknowledgments

HuggingFace Transformers
OpenAI CLIP (openai/clip-vit-base-patch32)
KoCLIP (Bingsu/clip-vit-large-patch14-ko)

License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

koclip-multimodal-consistency

Table of Contents

Description

Repository Structure

Requirments

Dataset Information

Overview

Data File Format

Consistency Score Codebook

Image Files

Code Information

Requirements

Dependencies

Hardware

Usage Instructions

Methodology

Citation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
model		model
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

koclip-multimodal-consistency

Table of Contents

Description

Repository Structure

Requirments

Dataset Information

Overview

Data File Format

Consistency Score Codebook

Image Files

Code Information

Requirements

Dependencies

Hardware

Usage Instructions

Methodology

Citation

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages