This project addresses the growing challenge of detecting hateful memes across multiple languages on social media platforms. Hateful memes combine text and images to convey offensive messages that target individuals or groups based on characteristics such as race, gender, ethnicity, and religion. The detection of such content requires advanced multimodal analysis techniques that can interpret both visual and textual elements simultaneously.
- Problem Statement
- Dataset
- Architecture
- Demo Video
- Setup
- Methodology
- Models Implemented
- Results
- Limitations
- Conclusion and Future Work
- References
Hateful content detection, especially in multimodal formats like memes, presents significant challenges due to:
- The complexity of interpreting combined visual and textual elements
- Multilingual content that requires cross-language understanding
- Nuanced and implicit hate speech that may not be apparent in either modality alone
- The need for robust automated detection systems that can scale to social media volumes
We define a hateful meme as content containing direct or indirect attacks on people based on protected characteristics, dehumanizing speech, statements of inferiority, calls for exclusion, or mockery of hate crimes.
The project utilizes a comprehensive dataset created by merging several existing datasets:
-
MET-Meme Dataset: 10,045 text-image pairs with manual annotations
- 6,045 Chinese images
- 4,000 English images
-
CM-Offensive Meme Dataset: 4,372 Hindi-English offensive memes
-
Facebook Hateful Meme Dataset: 10,000 multimodal examples specifically designed for hateful content detection
The final dataset comprises 26,432 images, with facial features extracted from 8,724 images to enrich the analysis capabilities.
The system employs a two-pipeline architecture for comprehensive meme analysis:
This pipeline focuses on extracting and processing features from both textual and visual components of memes:
This pipeline leverages the rich feature set created in Pipeline-1 to train multiple specialized models that are then combined through ensemble learning techniques:
A demonstration video showcasing the system's capabilities is available below:
The video demonstrates:
- System setup and requirements
- Processing of sample memes across multiple languages
- Feature extraction visualization
- Real-time classification of hateful vs. non-hateful content
- Performance analysis and model comparison
- Explanation of prediction outcomes
To run the demo yourself, follow the installation instructions in the [Viewing the Demo Video](#Viewing the Demo Video) section.
The demo video is stored directly in this repository. There are several ways to view it:
- Click the badge above to open the video file in GitHub's media viewer
- Clone the repository and open the video file locally:
git clone https://github.com/blackhat-coder21/Multi-Lingual-Hateful-Meme-Detection.git cd Multi-Lingual-Hateful-Meme-Detection/demo_video # Open Multi-Lingual Hateful Meme Detection Demo Video.mp4 with your video player
- Download just the video by navigating to the demo directory in the GitHub repository and clicking on the video file, then clicking the "Download" button
To run the demo yourself, follow the installation instructions in the Setup section.
python>=3.8
torch>=1.9.0
transformers>=4.12.0
deepface>=0.0.79
pillow>=8.3.1
opencv-python>=4.5.3
numpy>=1.20.0
pandas>=1.3.0
scikit-learn>=0.24.2
matplotlib>=3.4.3
seaborn>=0.11.2
# Clone the repository
git clone https://github.com/blackhat-coder21/Multi-Lingual-Hateful-Meme-Detection.git
cd Multi-Lingual-Hateful-Meme-Detection
# Create and activate a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run Streamlit UI
streamlit run app.py# For training the models run each code in kaggle save and download the model
# For evaluation run the ensemble model code
# For inference on new memes now run
streamlit run app.py- Text Extraction: Optical character recognition to capture text from meme images
- Language Normalization: Converting all text to standard English
- Text Preprocessing: Removing stopwords, normalizing links, handling special characters ("XD"), and processing code-mixed content
- Image Preprocessing: Rescaling, Gaussian blurring, and deskewing
- Facial Analysis: Using DeepFace to extract demographic information (age, gender, ethnicity) and emotional expressions
- Visual Question Answering: Employing BLIP model to generate contextual understanding through targeted questions
- Feature Fusion: Combining word embeddings (FastText, Word2Vec, GloVe) with visual features
- Consolidated Feature Set: Creating rich representations that incorporate textual, visual, demographic, emotional, and contextual information
- Pretrained CNNs: ResNet50, DenseNet121, Xception, VGG19, and VGG16 for deep visual feature extraction
- Traditional ML Models: Logistic Regression (LR), Support Vector Machines (SVM), and Multinomial Naive Bayes (MNB)
- Transformer Models: BERT, XLM-R, ViT, MuRIL, and Visual BERT
- Model Fine-tuning: Optimizing all models on the custom dataset derived from Pipeline-1
- Voting Ensemble: Combining predictions through majority voting to leverage each model's strengths
- Decision Fusion: Aggregating predictions across different model types to reduce error
- Architecture: Combines ResNet50 for image features (2048-dim) and BERT for text features (768-dim)
- Feature Fusion: Concatenation with a multi-layer classification head
- Performance: 79.79% accuracy, 0.80 F1 score
- Architecture: Combines DenseNet121 for image features (1024-dim) and BERT for text features (768-dim)
- Performance: 80.86% accuracy, 0.81 F1 score
- Architecture: Processes text sequences in both directions with embedding layer, two BiLSTM layers, and dense layers
- Performance: 70.01% accuracy, 0.69 F1 score
- Architecture: Combines MuRIL (for multilingual text) with a custom CNN for image processing
- Performance: 79.32% accuracy, 0.80 F1 score
- Architecture: Uses XLM-RoBERTa with enhanced classification head for multilingual text analysis
- Performance: 80.42% accuracy, 0.80 F1 score
- Architecture: Combines Vision Transformer (ViT) for images with BERT for text
- Performance: 81.20% accuracy, 0.81 F1 score
- Architecture: Combines predictions from all six models through majority voting
- Performance: 87.00% accuracy, 0.85 F1 score, 0.86 precision
The evaluation metrics for all models are as follows:
| Model | Accuracy | F1 Score | Precision |
|---|---|---|---|
| BiLSTM | 70.01% | 0.69 | 0.68 |
| XLM-R | 80.42% | 0.80 | 0.80 |
| ViT-BERT | 81.20% | 0.81 | 0.81 |
| MuRIL | 79.32% | 0.80 | 0.81 |
| ResNet50 | 79.79% | 0.80 | 0.82 |
| DenseNet121 | 80.42% | 0.81 | 0.81 |
| Voting Ensemble | 87.00% | 0.85 | 0.86 |
The Voting Ensemble significantly outperformed individual models, demonstrating the effectiveness of combining complementary approaches for this complex task.
Despite promising results, the system has several limitations:
- Text Extraction Errors: OCR inaccuracies can affect downstream text-based models
- Translation Issues: Converting non-English text to English may introduce semantic inaccuracies
- Computational Complexity: Multiple large-scale models increase memory and processing requirements
- Dataset Imbalance: Class imbalance may bias models toward the majority class
- Interpretability Challenges: Limited transparency in explaining predictions
- Simple Voting: Equal weighting in majority voting doesn't account for model confidence
The proposed system effectively combines diverse architectures for robust hateful meme detection across multilingual and multimodal inputs. By leveraging complementary strengths of different models through ensemble learning, we achieved significant improvements over single-model approaches.
- Replacing majority voting with trainable ensemble methods like stacking
- Expanding the dataset with more underrepresented languages
- Exploring lightweight transformer architectures for faster inference
- Improving model interpretability for better user trust
- Implementing active learning techniques to address class imbalance
- Abdullakutty, F. and Naseem, U., Decoding Memes: A Comprehensive Analysis of Late and Early Fusion Models for Explainable Meme Analysis.
- Jing Ma, Rong Li, RoJiNG-CL at EXIST 2024: Leveraging Large Language Models for Multimodal Sexism Detection in Memes.
- Ji, J., Lin, X., Naseem, U., CapAlign: Improving Cross Modal Alignment via Informative Captioning for Harmful Meme Detection.
- Huang, J., Lyu, H., Pan, J., Wan, Z., Luo, J. (2024), Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models for Hateful Meme Detection.
- Li, L., et al., VisualBERT: A Simple and Performant Baseline for Vision and Language.
- Conneau, A., et al., Unsupervised Cross-lingual Representation Learning at Scale.
- Dosovitskiy, A., et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
- Kakwani, D., et al., MuRIL: Multilingual Representations for Indian Languages.
- Schuster, M., Paliwal, K.K., Bidirectional Recurrent Neural Networks.
- He, K., Zhang, X., Ren, S., Sun, J., Deep Residual Learning for Image Recognition.
- Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., Densely Connected Convolutional Networks.
- Real-time Object Detection using YOLOv8.
- Hansheng, Haar Cascades Classifier: A Light-weight Face Detection Technique.
- Byte Explorer, DeepFace: A Library for Face Recognition and Facial Analysis.



