MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes
An automatically generated Chinese multi-turn dialogue dataset with contextually retrieved memes, featuring sophisticated meme retrieval and context-aware matching using role-based embeddings and advanced similarity scoring algorithms.
MemeCMD presents an automatically generated Chinese multi-turn dialogue dataset that integrates contextually retrieved memes to enhance conversational experiences. This innovative dataset addresses the growing need for culturally relevant and context-aware dialogue systems in Chinese social media and messaging applications.
The project combines cutting-edge meme retrieval technology with authentic Chinese conversational patterns, employing role-based embeddings and weighted cosine similarity metrics to identify the most contextually appropriate memes for dialogue scenarios. This makes it an invaluable resource for researchers working on:
- Chinese conversational AI development
- Multimodal dialogue systems
- Cultural context understanding in AI
- Social media content generation
- Cross-cultural communication research
- π¨π³ Chinese Multi-turn Dialogues: Authentic Chinese conversational patterns across multiple domains
- π Role-based Scenarios: Diverse conversation roles (news-based and role-based dialogues)
- πΌοΈ Contextual Meme Integration: 6000+ carefully curated memes with contextual relevance
- π Multiple Turn Lengths: Support for 6, 12, and 18-turn conversation flows
- π Three Selection Strategies: Random, Greedy, and Diversity-aware meme selection
- π― Role-based Embedding Processing: Advanced context-aware embedding generation
- βοΈ Weighted Cosine Similarity: Multi-component similarity computation with optimized weights
- π Top-K Retrieval: Configurable top-3 most relevant meme selection
- π Similarity Visualization: Comprehensive distribution analysis and plotting
- π Batch Processing: Efficient handling of large-scale datasets
- π Web-based Viewer: Interactive dialog visualization interface
- π Multi-metric Evaluation: Support for CLIP and GPT-4o scoring
MemeCMD/
βββ π Core Scripts
β βββ retrieve.py # Main retrieval engine and similarity computation
β βββ find_figures.py # Index-to-filename mapping utility
β βββ requirements.txt # Project dependencies
β
βββ π Data & Assets
β βββ Meme Warehouse/ # Meme embeddings and metadata
β β βββ EmojoPackage_processed/ # Processed meme images (6000+ files)
β β βββ figures.json # Meme metadata
β β βββ final_result.json # Processing results
β βββ Dialogs/ # Base dialog datasets
β βββ Dialogs_with_meme/ # Enhanced dialogs with meme annotations
β βββ Summary/ # Generated summaries and statistics
β
βββ πΌοΈ Examples & Visualization
β βββ Examples/ # Sample dialog screenshots
β βββ view-dialogs/ # Web-based dialog browser
β βββ imgs/ # Visualization outputs
β
βββ π Evaluation & Metrics
βββ metric/
βββ clip_dialog_similarity_zh.py # CLIP-based evaluation
βββ gpt4o_score.py # GPT-4o scoring
βββ clip-score/ # Evaluation results
The MemeCMD dataset provides comprehensive coverage of Chinese multi-turn dialogues with meme integration:
| Category | Description | Count |
|---|---|---|
| Total Memes | Processed meme images | 6,000+ |
| Dialogue Types | News-based & Role-based scenarios | 2 types |
| Turn Lengths | Conversation lengths | 6, 12, 18 turns |
| Selection Methods | Meme selection strategies | 3 methods |
| Total Dialogues | Generated dialogue instances | 18 variations |
| Languages | Primary language support | Chinese (ZH) |
- π° News-based Dialogues: Conversations centered around current events and news topics
- π Role-based Dialogues: Scenario-driven conversations with specific character roles
- π Selection Strategies:
- Random: Baseline random meme selection
- Greedy: Highest similarity score selection
- Diversity-aware: Balanced relevance and diversity selection
Each dialogue in our dataset is enriched with contextually relevant memes, selected using different strategies to enhance the conversation flow and emotional expression.
Note
Please check all the examples in Examples
- Python 3.7+
- NumPy
- Matplotlib
- Seaborn
- OpenAI API access (for GPT-4o evaluation)
# Clone the repository
git clone <repository-url>
cd MemeCMD
# Install dependencies
pip install -r requirements.txt# 1. Run the core retrieval system
python retrieve.py
# 2. Map results to actual image files
python find_figures.py
# 3. Launch web viewer for results
cd view-dialogs
python -m http.server 8000
# Visit http://localhost:8000 in your browserThe system supports various parameters for fine-tuning:
- Embedding dimensions: Configurable based on your model
- Similarity weights: Currently optimized as [0.3, -0.2, 0.2, 0.7]
- Top-K selection: Adjustable retrieval count
# Load a specific dialogue variant
import json
# Load news-based 12-turn dialogues with diversity-aware selection
with open('Dialogs_with_meme/news_based_12_turns_Diversity-awareSelection.json', 'r', encoding='utf-8') as f:
dialogues = json.load(f)
# Access dialogue content
for dialogue in dialogues:
print(f"Turns: {len(dialogue['conversation'])}")
print(f"Memes used: {len(dialogue['memes'])}")
# Compare different selection strategies
strategies = ['Random', 'GreedySelection', 'Diversity-awareSelection']
for strategy in strategies:
filename = f'Dialogs_with_meme/role_based_6_turns_{strategy}.json'
# Process each strategy variant...- π Context Processing: Role-based embeddings are generated from dialog contexts
- π Similarity Computation: Multi-component weighted cosine similarity calculation
- π― Ranking & Selection: Top-K memes selected based on combined similarity scores
- π Visualization: Similarity distributions plotted for analysis
- ποΈ Result Mapping: Numerical indices mapped to actual image filenames
- Embedding Normalization: All embeddings are L2-normalized before similarity computation
- Weighted Scoring: Employs a carefully tuned 4-component weighting system
- Batch Processing: Memory-efficient processing for large datasets (6000+ memes)
- Multi-format Output: Supports NPZ, JSON, and visualization formats
The system includes comprehensive evaluation tools:
python metric/clip_dialog_similarity_zh.pypython metric/gpt4o_score.py- π¦ NPZ Files: Compressed arrays containing top-3 indices and similarity scores
- π JSON Files: Human-readable mappings between indices and image filenames
- π Visualization: Similarity distribution plots and statistical summaries
- π Logs: Detailed processing logs with performance metrics
Access the interactive dialog viewer at view-dialogs/index.html to:
- Browse generated dialogs with meme annotations
- Compare different selection strategies (Random, Greedy, Diversity-aware)
- Analyze conversation flows and meme relevance
This work is associated with our research paper available on arXiv. If you use the MemeCMD dataset or methodology in your research, please consider citing:
@misc{wang2025memecmdautomaticallygeneratedchinese,
title={MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes},
author={Yuheng Wang and Xianhe Tang and Pufeng Huang},
year={2025},
eprint={2507.00891},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.00891},
}This dataset has been designed to support research in:
- Multimodal Dialogue Systems: Integration of text and visual elements in conversations
- Chinese NLP: Culturally-aware language understanding and generation
- Context-aware Information Retrieval: Semantic matching in conversational contexts
- Human-Computer Interaction: Natural and engaging dialogue system design
- Cross-cultural AI: Understanding cultural nuances in digital communication
We welcome contributions! Please feel free to:
- Report bugs and issues
- Suggest new features
- Submit pull requests
- Improve documentation
This project is licensed under the MIT License - see the LICENSE file for details.
- Thanks to the OpenAI team for CLIP embeddings
- Special recognition to the meme dataset contributors
- Community feedback and testing support
π§ Contact: For questions or collaboration opportunities, please refer to the paper or open an issue.

