Reproducible baselines for Korean image–hashtag consistency using CLIP/koCLIP.
This repository accompanies the paper:
Predicting Semantic Consistency between Images and Korean Hashtags on Instagram Jiyoon Oh & Jangmin Oh, School of AI Convergence, Sungshin Women's University
- Description
- Repository Structure
- Dataset Information
- Code Information
- Requirements
- Usage Instructions
- Methodology
- Citation
- Acknowledgments
- License
This repository investigates semantic consistency modeling between images and Korean hashtags, addressing the limitations of generic multimodal models in social media environments. On image-centric social media platforms such as Instagram, user-generated hashtags often reflect subjective feelings, slang, or abstract concepts rather than literal descriptions of visual content. This creates a substantial semantic gap that general-purpose multi-modal models (e.g., CLIP) struggle to capture.
This project presents training and evaluation scripts for a five-class semantic consistency prediction task (scores 1–5), which quantifies the alignment between images and Korean hashtags.
We compare three progressive training strategies:
- Similarity-based baselines
- Frozen-backbone classifiers
- End-to-end fine-tuning
Experiments are conducted using both CLIP and KoCLIP backbones.
koclip-multimodal-consistency/
│
├── README.md
├── requirements.txt
│
├── data/
│ ├── combined_data.csv # Annotated image–hashtag pairs
│ ├── categorical_codebook.csv # Mapping table for categorical labels (Score 1–5 definitions)
│ └── English-language-codebook.md # Full English data dictionary (variable & label schema)
│
├── model/
│ ├── baseline/
│ │ ├── baseline_1_1_clip.py # CLIP similarity-based baseline
│ │ ├── baseline_1_2_koclip.py # KoCLIP similarity-based baseline
│ │ ├── baseline_2_1_clip.py # CLIP (frozen backbone + classifier head)
│ │ └── baseline_2_2_koclip.py # KoCLIP (frozen backbone + classifier head)
│ │
│ ├── FT_clip.py # CLIP end-to-end fine-tuning
│ └── FT_koclip.py # KoCLIP end-to-end fine-tuning
│
└── results/ # Saved checkpoints, metrics, and plots
Create a virtual environment (macOS):
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
- Minimal requirements.txt:
torch
transformers
pandas
numpy
pillow
tqdm
scikit-learn
matplotlib
seaborn
The dataset consists of approximately 2,000 image–hashtag pairs collected from public Korean Instagram posts.
Posts were gathered across nine hashtag categories representing lifestyle, technology, and seasonal trends:
| Korean Hashtag | English Translation |
|---|---|
| #국내여행 | Domestic Travel |
| #ai | AI |
| #가을 | Autumn |
| #제로칼로리 | Zero Calories |
| #맛집 | Good Restaurant |
| #영화 | Movie |
| #올리브영 | Olive Young (cosmetics brand) |
| #패션 | Fashion |
| #iphone16 | iPhone 16 |
For each of the nine seed hashtags, the 20 most recent public posts were collected using a Python Selenium-based web crawler, resulting in 180 unique posts.
From each post, the primary image and all associated hashtags were extracted. After removing duplicates and invalid entries, this process yielded approximately 2,000 valid image–hashtag pairs.
The main data file data/combined_data.csv contains the following columns:
| Column | Type | Description |
|---|---|---|
Image Filename |
string | Filename of the image (e.g., abc.jpg). Identical numeric identifiers correspond to the same image instance. |
Hashtag |
string | Korean hashtag text associated with the image |
Score |
integer (1–5) | Semantic consistency score between the image and hashtag |
Hashtag_en |
string | English translation of the original Hashtag column. Provided for interpretability and reference purposes only. |
Important Note:
The experimental models described in this study were trained using the original Hashtag column.
The Hashtag_en column is included solely to help non-Korean-speaking readers understand the semantic meaning of each hashtag and does not affect the training pipeline.
Each image–hashtag pair is assigned a semantic consistency score on a five-point Likert scale, annotated using OpenAI GPT-4 Structured Output.
| Score | Label | Description |
|---|---|---|
| 1 | Very Low | The hashtag has no semantic connection to the image content |
| 2 | Low | The hashtag has minimal or tangential relevance to the image |
| 3 | Medium | The hashtag is partially related but not strongly aligned with the image |
| 4 | High | The hashtag is clearly relevant and semantically consistent with the image |
| 5 | Very High | The hashtag perfectly describes or complements the image content |
To assess reliability, a subset of samples was manually validated. The agreement between human annotations and GPT-based labels achieved Cohen's Kappa (κ = 0.68), indicating substantial agreement.
Due to copyright and privacy considerations, the raw image files are not directly included in this repository.
The image files used in this study are available at:
🔗 Download Image Dataset (Google Drive)
After downloading, place the image files under: data/image/ and and update the image path in the training scripts accordingly.
All scripts are located in model/baseline/ and implement three progressive training strategies:
| Script | Model | Strategy | Description |
|---|---|---|---|
baseline_1_1_clip.py |
CLIP | Similarity-based | Cosine similarity with fixed thresholds |
baseline_1_2_koclip.py |
KoCLIP | Similarity-based | Cosine similarity with fixed thresholds |
baseline_2_1_clip.py |
CLIP | Frozen backbone + classifier | Frozen encoder with trainable MLP head |
baseline_2_2_koclip.py |
KoCLIP | Frozen backbone + classifier | Frozen encoder with trainable MLP head |
FT_clip.py |
CLIP | End-to-end fine-tuning | Full model fine-tuning with classifier head |
FT_koclip.py |
KoCLIP | End-to-end fine-tuning | Full model fine-tuning with classifier head |
Python 3.11 is recommended. Create a virtual environment and install dependencies:
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -r requirements.txttorch
transformers
pandas
numpy
pillow
tqdm
scikit-learn
matplotlib
seaborn
All experiments were conducted on an NVIDIA TITAN RTX GPU (24 GB VRAM). Classification-based models were trained for up to 20 epochs with a batch size of 32.
-
Prepare the dataset: Place your images under
data/image/and ensuredata/combined_data.csvcontains the correct filenames. -
Adjust file paths: Edit the data paths inside each script to match your local directory structure.
-
Run a baseline:
# Similarity baseline (threshold search)
python model/baseline/baseline_1_2_koclip.py
# Frozen backbone + classifier
python model/baseline/baseline_2_2_koclip.py
# End-to-end fine-tuning
python model/baseline/FT_koclip.py- Results:
- Dataset split: 80/20 (train/test)
- Metrics printed: Accuracy, Precision, Recall, F1-score
- Confusion matrices saved as image files
- Training curves exported as training_results.png
- Fine-tuned weights saved as .pth files
- Data Collection: Korean Instagram posts collected via Selenium-based web crawler across nine hashtag categories.
- Automated Labeling: GPT-4 Structured Output was used to assign five-level semantic consistency scores. Human validation achieved Cohen’s Kappa (κ = 0.68), indicating substantial agreement.
- Model Training: Three progressive strategies evaluate similarity-based baselines, frozen backbone with trainable classifier head, and end-to-end fine-tuning using both CLIP (
clip-vit-base-patch32) and KoCLIP (Bingsu/clip-vit-large-patch14-ko) backbones. - Evaluation: 80/20 train/test split with cross-entropy loss, Adam optimizer (lr = 5e-5), and early stopping.
For full details, please refer to the accompanying paper.
If you use this repository in academic work, please cite:
@article{oh2025koclip,
author = {Oh, Jiyoon and Oh, Jangmin},
title = {Predicting Semantic Consistency between Images and Korean Hashtags on Instagram},
journal = {PeerJ Computer Science},
year = {2025},
url = {https://github.com/askjiyun/koclip-multimodal-consistency}
}- HuggingFace Transformers
- OpenAI CLIP (
openai/clip-vit-base-patch32) - KoCLIP (
Bingsu/clip-vit-large-patch14-ko)
This project is released under the MIT License.