👉🏻 MER-Factory 👈🏻

中文｜ English

Important

✍️ Challenge: Multimodal Affect Computing isn't just one step-it's the entire fragmented pipeline. The journey from raw files to a trained model is a gauntlet of tedious data preprocessing, slow and inconsistent manual annotation, and complex model training setups.

🏭 MER-Factory: Unifies this entire workflow into one seamless factory. We automate the heavy lifting of preprocessing and annotation to generate high-quality, reason-augmented datasets, and then bridge the gap directly to model training.

🚀 Stop juggling different tools: Let our factory handle the pipeline, so you can focus on what matters: your research.

🚀 Project Roadmap

MER-Factory is under active development with new features being added regularly - check our roadmap and welcome contributions!

Pipeline Structure

Click here to expand/collapse

Remove for now, call the (print(app.get_graph().draw_mermaid())) graph.py to view

Features

Action Unit (AU) Pipeline: Extracts facial Action Units (AUs) and translates them into descriptive natural language.
Audio Analysis Pipeline: Extracts audio, transcribes speech, and performs detailed tonal analysis.
Video Analysis Pipeline: Generates comprehensive descriptions of video content and context.
Image Analysis Pipeline: Provides end-to-end emotion recognition for static images, complete with visual descriptions and emotional synthesis.
Full MER Pipeline: An end-to-end multimodal pipeline that identifies peak emotional moments, analyzes all modalities (visual, audio, facial), and synthesizes a holistic emotional reasoning summary.
Gate Agent (Experimental): An optional quality control layer that reviews intermediate analysis results. Following the "garbage in, garbage out" principle, it rejects low-quality or conflicting outputs and prompts sub-agents to refine their analysis before final synthesis. Enable with --use-gate-agent.

Check out example outputs here:

Installation

📚 Please visit project documentation for detailed installation and usage instructions.

Note

For Windows users, simply download the pre-built ffmpeg and OpenFace and place them as requested.

We highly recommend serving the HF model/Ollama model on Linux and running MER-Factory on Windows to reduce installation time.

But, for those love the command line (e.g., me), a complete installation example for Linux environments (including Google Colab) can be found at:

examples/MER_Factory.ipynb

Usage

Basic Command Structure

python main.py [INPUT_PATH] [OUTPUT_DIR] [OPTIONS]

Examples

# Show all supported args.
python main.py --help

# Full MER pipeline with Gemini (default)
python main.py path_to_video/ output/ --type MER --silent --threshold 0.8

# Using Sentiment Analysis task instead of MERR
python main.py path_to_video/ output/ --type MER --task "Sentiment Analysis" --silent

# Using ChatGPT models
python main.py path_to_video/ output/ --type MER --chatgpt-model gpt-4o --silent

# Using local Ollama models
python main.py path_to_video/ output/ --type MER --ollama-vision-model llava-llama3:latest --ollama-text-model llama3.2 --silent

# Using Hugging Face model
python main.py path_to_video/ output/ --type MER --huggingface-model google/gemma-3n-E4B-it --silent

# Process images instead of videos
python main.py ./images ./output --type MER

Note: Run ollama pull llama3.2 etc, if Ollama model is needed. Ollama does not support video analysis for now.

Hugging Face Client-Server Setup

When selecting a Hugging Face model with --huggingface-model, MER-Factory forwards all calls through a lightweight client that talks to a local/remote API server which actually hosts the HF model. This keeps your main environment clean and allows easy scaling.

Start the HF API Server (in a separate terminal):

# Example: serve Whisper base on port 7860
python -m mer_factory.models.hf_api_server --model_id openai/whisper-base --host 0.0.0.0 --port 7860

Run MER-Factory as usual and select the HF model by ID:

python main.py path_to_video/ output/ --type MER --huggingface-model openai/whisper-base --silent

Dashboard for Data Curation and Hyperparameter Tuning

We provide an interactive dashboard webpage to facilitate data curation and hyperparameter tuning. The dashboard allows you to test different prompts, save and run configurations, and rate the generated data.

To launch the dashboard, use the following command:

python dashboard.py

Command Line Options

Option	Short	Description	Default
`--type`	`-t`	Processing type (AU, audio, video, image, MER)	MER
`--task`	`-tk`	Analysis task type (MERR, Sentiment Analysis)	MERR
`--label-file`	`-l`	Path to a CSV file with 'name' and 'label' columns. Optional, for ground truth labels.	None
`--threshold`	`-th`	Emotion detection threshold (0.0-5.0)	0.8
`--peak_dis`	`-pd`	Steps between peak frame detection (min 8)	15
`--silent`	`-s`	Run with minimal output	False
`--cache`	`-ca`	Reuse existing audio/video/AU results from previous pipeline runs	False
`--concurrency`	`-c`	Concurrent files for async processing (min 1)	4
`--ollama-vision-model`	`-ovm`	Ollama vision model name	None
`--ollama-text-model`	`-otm`	Ollama text model name	None
`--chatgpt-model`	`-cgm`	ChatGPT model name (e.g., gpt-4o)	None
`--huggingface-model`	`-hfm`	Hugging Face model ID	None
`--use-gate-agent`	`-uga`	Enable Gate Agent for quality control (Dev Feature)	False

Processing Types

1. Action Unit (AU) Extraction

Extracts facial Action Units and generates natural language descriptions:

python main.py video.mp4 output/ --type AU

2. Audio Analysis

Extracts audio, transcribes speech, and analyzes tone:

python main.py video.mp4 output/ --type audio

3. Video Analysis

Generates comprehensive video content descriptions:

python main.py video.mp4 output/ --type video

4. Image Analysis

Runs the pipeline with image input:

python main.py ./images ./output --type image
# Note: Image files will automatically use image pipeline regardless of --type setting

5. Full MER Pipeline (Default)

Runs the complete multimodal emotion recognition pipeline:

python main.py video.mp4 output/ --type MER
# or simply:
python main.py video.mp4 output/

Task Types

The --task option allows you to choose between different analysis tasks:

1. Emotion Recognition (Default)

Performs detailed emotion analysis with granular emotion categories:

python main.py video.mp4 output/ --task "MERR"
# or simply omit the --task option since it's the default
python main.py video.mp4 output/

2. Sentiment Analysis

Performs sentiment-focused analysis (positive, negative, neutral):

python main.py video.mp4 output/ --task "Sentiment Analysis"

Export the Dataset

To export datasets for curation or training, use the following commands:

For Dataset Curation

python export.py --output_folder "{output_folder}" --file_type {file_type.lower()} --export_path "{export_path}" --export_csv

For Training

python export.py --input_csv path/to/csv_file.csv --export_format sharegpt

Evaluate the Results

MER-Factory includes a comprehensive reference-free evaluation toolkit to assess the quality of generated annotations without human ratings.

Basic Evaluation

# Evaluate all samples in output directory
python tools/evaluate.py output/ --export-csv output/evaluation_summary.csv

Advanced Evaluation Options

# Run with verbose output to see detailed failure reasons
python tools/evaluate.py output/ --export-csv output/evaluation_summary.csv --verbose

# Skip writing per-sample evaluation files
python tools/evaluate.py output/ --export-csv output/evaluation_summary.csv --no-write-per-sample

Evaluation Metrics

The evaluation toolkit provides multiple quality metrics:

🖼️ CLIP Image Score: Visual grounding between images and descriptions
🔊 CLAP Audio Score: Audio-text alignment using LAION-CLAP
😊 AU F1 Score: Facial expression accuracy vs OpenFace AUs
🔗 NLI Consistency: Logical consistency across modalities
🎙️ ASR WER: Speech recognition quality vs Whisper baseline
📝 Text Quality: Distinctness, repetition, and readability metrics
🎯 Composite Score: Overall quality (0-100) combining all metrics

Evaluation Output

Per-sample: evaluation.json files in each sample directory
Dataset-level: evaluation_summary.csv with rankings and statistics
Console: Beautiful progress bars and top-performing samples table

For detailed evaluation documentation, see tools/evaluate/README.md.

Model Support

The tool supports four types of models:

Google Gemini (default): Requires GOOGLE_API_KEY in .env
OpenAI ChatGPT: Requires OPENAI_API_KEY in .env, specify with --chatgpt-model
Ollama: Local models, specify with --ollama-vision-model and --ollama-text-model
Hugging Face: Currently supports multimodal models like google/gemma-3n-E4B-it

Note: If using Hugging Face models, concurrency is automatically set to 1 for synchronous processing.

Model Recommendations

When to Use Ollama

Recommended for: Image analysis, Action Unit analysis, text processing, and simple audio transcription tasks.

Benefits:

✅ Async support: Ollama supports asynchronous calling, making it ideal for processing large datasets efficiently
✅ Local processing: No API costs or rate limits
✅ Wide model selection: Visit ollama.com to explore available models
✅ Privacy: All processing happens locally

Example usage:

# Process images with Ollama
python main.py ./images ./output --type image --ollama-vision-model llava-llama3:latest --ollama-text-model llama3.2 --silent

# AU extraction with Ollama
python main.py video.mp4 output/ --type AU --ollama-text-model llama3.2 --silent

When to Use ChatGPT/Gemini

Recommended for: Advanced video analysis, complex multimodal reasoning, and high-quality content generation.

Benefits:

✅ State-of-the-art performance: Latest GPT-4o and Gemini models offer superior reasoning capabilities
✅ Advanced video understanding: Better support for complex video analysis and temporal reasoning
✅ High-quality outputs: More nuanced and detailed emotion recognition and reasoning
✅ Robust multimodal integration: Excellent performance across text, image, and video modalities

Example usage:

python main.py video.mp4 output/ --type MER --chatgpt-model gpt-4o --silent

python main.py video.mp4 output/ --type MER --silent

Trade-offs: API costs and rate limits, but typically provides the highest quality results for complex emotion reasoning tasks.

When to Use Hugging Face Models

Recommended for: When you need the latest state-of-the-art models or specific features not available in Ollama.

Custom Model Integration: If you want to use the latest HF models or features that Ollama doesn't support:

Option 1 - Implement yourself: Navigate to mer_factory/models/hf_models/__init__.py to register your own model and implement the needed functions following our existing patterns.
Option 2 - Request support: Open an issue on our repository to let us know which model you'd like us to support, and we'll consider adding it.

Current supported models: google/gemma-3n-E4B-it and others listed in the HF models directory.

Training

This training guide will walk you through the complete end-to-end process from Data Analysis/Annotation to Launching Model Training. The process is divided into two main stages:

Stage One: Automated Data Preparation: Use the train.sh script to convert the analysis output from MER-Factory into the standard dataset format required by the training framework with a single command, and automatically complete the registration.
Stage Two: Interactive Training Launch: Start the LLaMA-Factory graphical user interface (Web UI), load the prepared dataset, and freely configure all training parameters.

Prerequisites

Before you begin, please ensure that you have completed the following environmental preparations:

Initialize Submodules

This project uses Git Submodules to integrate LLaMA-Factory to ensure version consistency and reproducibility of the training environment.

After cloning this repository, please run the following command to initialize and download the submodules:
```
git submodule update --init --recursive
```

Install Dependencies This project and the LLaMA-Factory submodule have their own separate dependency environments, which need to be installed individually:

# 1. Install the main dependencies for MER-Factory
pip install -r requirements.txt

# 2. Install the dependencies for the LLaMA-Factory submodule
pip install -r LLaMA-Factory/requirements.txt

Stage One: Automated Data Preparation

After you have finished analyzing the raw data using main.py, you can use the train.sh script to prepare the dataset.

The core task of this script is to automate all the tedious data preparation work. It reads the analysis results from MER-Factory, converts them into the ShareGPT format required by LLaMA-Factory, and automatically registers the dataset within LLaMA-Factory.

Usage Example

To ensure the traceability and consistency of experiments, we recommend naming your dataset using the following format:

RawDataset_AnalysisModel_TaskType

Process data for an MER task and name the dataset according to the convention:

# Assuming the llava and llama3.2 analysis models were used
bash train.sh --file_type "image" --dataset_name "mer2025_llava_llama3.2_MER"

Process data for an audio task and name the dataset according to the convention:

# Assuming the gemini api model was used
bash train.sh --file_type "audio" --dataset_name "mer2025_gemini_audio"

Process data for a video task and name the dataset according to the convention:

# Assuming the gemini api model was used
bash train.sh --file_type "video" --dataset_name "mer2025_gemini_video"

Process data for an image task and name the dataset according to the convention:

# Assuming the chatgpt gpt-4o model was used
bash train.sh --file_type "mer" --dataset_name "mer2025_gpt-4o_image"

After the script runs successfully, your dataset (e.g., mer2025_llava_llama3.2_MER) will be ready and registered in LLaMA-Factory's dataset_info, making it directly available for use in the next stage.

Stage Two: Launch Training (Start LLaMA-Factory Web UI)

Once your dataset is ready, you can launch the LLaMA-Factory graphical interface to configure and start your training task.

Navigate to the LLaMA-Factory Directory
```
cd LLaMA-Factory
```
Start the Web UI
```
llamafactory-cli webui
```
Configure and Train in the Web UI

Citation

If you find MER-Factory useful in your research or project, please consider giving us a ⭐! Your support helps us grow and continue improving.

Additionally, if you use MER-Factory in your work, please consider cite us using the following BibTeX entries:

@software{Lin_MER-Factory_2025,
  author = {Lin, Yuxiang and Zheng, Shunchao},
  doi = {10.5281/zenodo.15847351},
  license = {MIT},
  month = {7},
  title = {{MER-Factory}},
  url = {https://github.com/Lum1104/MER-Factory},
  version = {0.1.0},
  year = {2025}
}

@inproceedings{NEURIPS2024_c7f43ada,
  author = {Cheng, Zebang and Cheng, Zhi-Qi and He, Jun-Yan and Wang, Kai and Lin, Yuxiang and Lian, Zheng and Peng, Xiaojiang and Hauptmann, Alexander},
  booktitle = {Advances in Neural Information Processing Systems},
  editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
  pages = {110805--110853},
  publisher = {Curran Associates, Inc.},
  title = {Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning},
  url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/c7f43ada17acc234f568dc66da527418-Paper-Conference.pdf},
  volume = {37},
  year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
.github		.github
LLaMA-Factory @ 59f2bf1		LLaMA-Factory @ 59f2bf1
docs		docs
examples		examples
mer_factory		mer_factory
test		test
tools		tools
utils		utils
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
dashboard.py		dashboard.py
export.py		export.py
main.py		main.py
requirements.txt		requirements.txt
train.sh		train.sh

License

Lum1104/MER-Factory

Folders and files

Latest commit

History

Repository files navigation

👉🏻 MER-Factory 👈🏻

🚀 Project Roadmap

Table of Contents

Pipeline Structure

Features

Installation

Usage

Basic Command Structure

Examples

Hugging Face Client-Server Setup

Dashboard for Data Curation and Hyperparameter Tuning

Command Line Options

Processing Types

1. Action Unit (AU) Extraction

2. Audio Analysis

3. Video Analysis

4. Image Analysis

5. Full MER Pipeline (Default)

Task Types

1. Emotion Recognition (Default)

2. Sentiment Analysis

Export the Dataset

For Dataset Curation

For Training

Evaluate the Results

Basic Evaluation

Advanced Evaluation Options

Evaluation Metrics

Evaluation Output

Model Support

Model Recommendations

When to Use Ollama

When to Use ChatGPT/Gemini

When to Use Hugging Face Models

Training

Prerequisites

Stage One: Automated Data Preparation

Usage Example

Stage Two: Launch Training (Start LLaMA-Factory Web UI)

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Contributors 3

Uh oh!

Languages