Skip to content

Nahtreom/MemeCMD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes

arXiv License

MemeCMD Dataset Overview

An automatically generated Chinese multi-turn dialogue dataset with contextually retrieved memes, featuring sophisticated meme retrieval and context-aware matching using role-based embeddings and advanced similarity scoring algorithms.

πŸš€ Overview

MemeCMD presents an automatically generated Chinese multi-turn dialogue dataset that integrates contextually retrieved memes to enhance conversational experiences. This innovative dataset addresses the growing need for culturally relevant and context-aware dialogue systems in Chinese social media and messaging applications.

The project combines cutting-edge meme retrieval technology with authentic Chinese conversational patterns, employing role-based embeddings and weighted cosine similarity metrics to identify the most contextually appropriate memes for dialogue scenarios. This makes it an invaluable resource for researchers working on:

  • Chinese conversational AI development
  • Multimodal dialogue systems
  • Cultural context understanding in AI
  • Social media content generation
  • Cross-cultural communication research

✨ Key Features

πŸ“Š Dataset Contributions

  • πŸ‡¨πŸ‡³ Chinese Multi-turn Dialogues: Authentic Chinese conversational patterns across multiple domains
  • 🎭 Role-based Scenarios: Diverse conversation roles (news-based and role-based dialogues)
  • πŸ–ΌοΈ Contextual Meme Integration: 6000+ carefully curated memes with contextual relevance
  • πŸ”„ Multiple Turn Lengths: Support for 6, 12, and 18-turn conversation flows
  • πŸ“ˆ Three Selection Strategies: Random, Greedy, and Diversity-aware meme selection

πŸ› οΈ Technical Features

  • 🎯 Role-based Embedding Processing: Advanced context-aware embedding generation
  • βš–οΈ Weighted Cosine Similarity: Multi-component similarity computation with optimized weights
  • πŸ” Top-K Retrieval: Configurable top-3 most relevant meme selection
  • πŸ“Š Similarity Visualization: Comprehensive distribution analysis and plotting
  • πŸ”„ Batch Processing: Efficient handling of large-scale datasets
  • 🌐 Web-based Viewer: Interactive dialog visualization interface
  • πŸ“ˆ Multi-metric Evaluation: Support for CLIP and GPT-4o scoring

πŸ“ Project Structure

MemeCMD/
β”œβ”€β”€ 🐍 Core Scripts
β”‚   β”œβ”€β”€ retrieve.py           # Main retrieval engine and similarity computation
β”‚   β”œβ”€β”€ find_figures.py       # Index-to-filename mapping utility
β”‚   └── requirements.txt      # Project dependencies
β”‚
β”œβ”€β”€ πŸ“Š Data & Assets
β”‚   β”œβ”€β”€ Meme Warehouse/       # Meme embeddings and metadata
β”‚   β”‚   β”œβ”€β”€ EmojoPackage_processed/  # Processed meme images (6000+ files)
β”‚   β”‚   β”œβ”€β”€ figures.json      # Meme metadata
β”‚   β”‚   └── final_result.json # Processing results
β”‚   β”œβ”€β”€ Dialogs/             # Base dialog datasets
β”‚   β”œβ”€β”€ Dialogs_with_meme/   # Enhanced dialogs with meme annotations
β”‚   └── Summary/             # Generated summaries and statistics
β”‚
β”œβ”€β”€ πŸ–ΌοΈ Examples & Visualization
β”‚   β”œβ”€β”€ Examples/            # Sample dialog screenshots
β”‚   β”œβ”€β”€ view-dialogs/        # Web-based dialog browser
β”‚   └── imgs/               # Visualization outputs
β”‚
└── πŸ“ Evaluation & Metrics
    └── metric/
        β”œβ”€β”€ clip_dialog_similarity_zh.py  # CLIP-based evaluation
        β”œβ”€β”€ gpt4o_score.py                # GPT-4o scoring
        └── clip-score/                   # Evaluation results

πŸ“Š Dataset Statistics

The MemeCMD dataset provides comprehensive coverage of Chinese multi-turn dialogues with meme integration:

Category Description Count
Total Memes Processed meme images 6,000+
Dialogue Types News-based & Role-based scenarios 2 types
Turn Lengths Conversation lengths 6, 12, 18 turns
Selection Methods Meme selection strategies 3 methods
Total Dialogues Generated dialogue instances 18 variations
Languages Primary language support Chinese (ZH)

Dialogue Categories

  • πŸ“° News-based Dialogues: Conversations centered around current events and news topics
  • 🎭 Role-based Dialogues: Scenario-driven conversations with specific character roles
  • πŸ”€ Selection Strategies:
    • Random: Baseline random meme selection
    • Greedy: Highest similarity score selection
    • Diversity-aware: Balanced relevance and diversity selection

πŸ“± Examples

Example Meme

Each dialogue in our dataset is enriched with contextually relevant memes, selected using different strategies to enhance the conversation flow and emotional expression.

Note

Please check all the examples in Examples

πŸ› οΈ Installation

Prerequisites

  • Python 3.7+
  • NumPy
  • Matplotlib
  • Seaborn
  • OpenAI API access (for GPT-4o evaluation)

Quick Setup

# Clone the repository
git clone <repository-url>
cd MemeCMD

# Install dependencies
pip install -r requirements.txt

πŸš€ Quick Start

Basic Usage

# 1. Run the core retrieval system
python retrieve.py

# 2. Map results to actual image files
python find_figures.py

# 3. Launch web viewer for results
cd view-dialogs
python -m http.server 8000
# Visit http://localhost:8000 in your browser

Advanced Configuration

The system supports various parameters for fine-tuning:

  • Embedding dimensions: Configurable based on your model
  • Similarity weights: Currently optimized as [0.3, -0.2, 0.2, 0.7]
  • Top-K selection: Adjustable retrieval count

Dataset Usage Examples

# Load a specific dialogue variant
import json

# Load news-based 12-turn dialogues with diversity-aware selection
with open('Dialogs_with_meme/news_based_12_turns_Diversity-awareSelection.json', 'r', encoding='utf-8') as f:
    dialogues = json.load(f)

# Access dialogue content
for dialogue in dialogues:
    print(f"Turns: {len(dialogue['conversation'])}")
    print(f"Memes used: {len(dialogue['memes'])}")
    
# Compare different selection strategies
strategies = ['Random', 'GreedySelection', 'Diversity-awareSelection']
for strategy in strategies:
    filename = f'Dialogs_with_meme/role_based_6_turns_{strategy}.json'
    # Process each strategy variant...

πŸ”§ How It Works

Algorithm Overview

  1. πŸ“ Context Processing: Role-based embeddings are generated from dialog contexts
  2. πŸ” Similarity Computation: Multi-component weighted cosine similarity calculation
  3. 🎯 Ranking & Selection: Top-K memes selected based on combined similarity scores
  4. πŸ“Š Visualization: Similarity distributions plotted for analysis
  5. πŸ—‚οΈ Result Mapping: Numerical indices mapped to actual image filenames

Technical Details

  • Embedding Normalization: All embeddings are L2-normalized before similarity computation
  • Weighted Scoring: Employs a carefully tuned 4-component weighting system
  • Batch Processing: Memory-efficient processing for large datasets (6000+ memes)
  • Multi-format Output: Supports NPZ, JSON, and visualization formats

πŸ“Š Evaluation Metrics

The system includes comprehensive evaluation tools:

CLIP-based Evaluation

python metric/clip_dialog_similarity_zh.py

GPT-4o Scoring

python metric/gpt4o_score.py

πŸ“ˆ Output Formats

Generated Files

  • πŸ“¦ NPZ Files: Compressed arrays containing top-3 indices and similarity scores
  • πŸ“‹ JSON Files: Human-readable mappings between indices and image filenames
  • πŸ“Š Visualization: Similarity distribution plots and statistical summaries
  • πŸ“ Logs: Detailed processing logs with performance metrics

Web Interface

Access the interactive dialog viewer at view-dialogs/index.html to:

  • Browse generated dialogs with meme annotations
  • Compare different selection strategies (Random, Greedy, Diversity-aware)
  • Analyze conversation flows and meme relevance

πŸ”¬ Research & Citation

This work is associated with our research paper available on arXiv. If you use the MemeCMD dataset or methodology in your research, please consider citing:

@misc{wang2025memecmdautomaticallygeneratedchinese,
      title={MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes}, 
      author={Yuheng Wang and Xianhe Tang and Pufeng Huang},
      year={2025},
      eprint={2507.00891},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.00891}, 
}

Research Applications

This dataset has been designed to support research in:

  • Multimodal Dialogue Systems: Integration of text and visual elements in conversations
  • Chinese NLP: Culturally-aware language understanding and generation
  • Context-aware Information Retrieval: Semantic matching in conversational contexts
  • Human-Computer Interaction: Natural and engaging dialogue system design
  • Cross-cultural AI: Understanding cultural nuances in digital communication

🀝 Contributing

We welcome contributions! Please feel free to:

  • Report bugs and issues
  • Suggest new features
  • Submit pull requests
  • Improve documentation

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Thanks to the OpenAI team for CLIP embeddings
  • Special recognition to the meme dataset contributors
  • Community feedback and testing support

πŸ“§ Contact: For questions or collaboration opportunities, please refer to the paper or open an issue.

About

A New Framework for Meme Search and Semantic Labeling.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors