Skip to content

slptongji/ES-MemEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ES-MemEval

This repository contains the dataset and code for our paper: "ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support".

Enviroment Setup

Dependency Installation

To execute the code in this repository, please install the required dependencies using Anaconda with the following commands:

conda create -n ES_MemEval python=3.13.5
conda activate ES_MemEval
pip install -r requirements.txt

The experiments were conducted on the following environments and have been verified to run successfully on various configurations:

  • Operating System: Ubuntu 18.04.5 LTS (also tested on Ubuntu 24.04.3 LTS)

  • GPU: NVIDIA TITAN RTX / GeForce RTX 3080 / A100

  • CUDA Version: 12.4 (also verified with CUDA 13.0)

Python Configuration

The PYTHONPATH environment variable should be configured to point to the src directory to ensure correct module imports :

export PYTHONPATH="./src"    # Should be replaced with an absolute path, unless the working directory is fixed to the repository root. (In VSCode, this will be automatically configured.)

Project Configuration

All project-level configurations are defined in src/exe/common_configurations.py, which specifies key parameters such as API endpoints for large language models (LLMs), data and output paths, and other global settings.

LLM Configuation

By default, the OpenAI API key is loaded from ./secrets/open_ai_api_key.txt (a relative path). To modify this behavior, locate and update the following expression in common_configurations.py:

_csfile.read_all_text("./secrets/open_ai_api_key.txt")

Likewise, the default configuration defines other LLM endpoints as locally hosted on _vllm_host, exposed through OpenAI-compatible APIs on ports 89118913.

These parameters can be modified in common_configurations.py if alternative hosts or ports are required.

To deploy equivalent services locally, the models can be launched using vLLM as illustrated below:

# Create a vLLM environment
conda create -n ES_MemEval_vLLM python=3.12
conda activate ES_MemEval_vLLM
pip install uv
uv pip install vllm --torch-backend=auto

# Some models may require the signing of agreements
hf auth login

# Download and serve the models
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --port 8911
vllm serve mistralai/Ministral-8B-Instruct-2410 --port 8912
vllm serve microsoft/Phi-3-medium-128k-instruct --port 8913

Multiprocessing

All experimental scripts support multiprocessing. Setting multiprocessing_workers to a value greater than zero enables parallel execution. However, during the first run, model downloads may occur; therefore, it is recommended to first run with multiprocessing_workers = 0 to verify correctness, and then restart the experiment with multiprocessing enabled.

To avoid console clutter caused by simultaneous process outputs, the standard output is suppressed during execution. Only summary messages are displayed before and after the experiment. Some models may print progress logs during their initial download phase, which can temporarily dominate the console output. If the model download has completed but no new logs appear, the experiment is likely already in progress.

Dataset Access

The EvoEmo dataset is provided in this repository at data/evo_emo.json.

For external Python projects, data loading utilities are available in src/lib/shared/data_provider.

Running the Code

After completing the environment setup, experiments can be executed by running the corresponding scripts located in the exe directory. For example, to evaluate the question-answering task using the Mistral-8B model with full dialogue history, run:

python ./src/exe/qa/qa_mistral8b_full.py

By default, a directory named by current datetime will be created in ./outputs/qa_mistral8b_full. Result of each seeker will be saved as a csv file in a sub-directory named with the seeker's id. The contents of this file will be updated in real time to show the current progress. After running, you can check exception.csv to see if any exceptions occurred in each process; the final results of all seekers are merged into result.csv.

Script Configuration

Besides common_configurations.py, to ensure that configuration can be modified independently of other scripts, there will be a Config class in any executable script. Its scope is limited to this script.

Experiment List

Here is a list showing the execute scripts, and its relationship to our manuscript.

Table 3

Category Model F1 Score (%) ↑ BERTScore (%) ↑ LLM-as-Judge (0-2) ↑
IE TR CD Abs UM All IE TR CD Abs UM All IE TR CD Abs UM All
Base Mistral-8B exe/qa/qa_mistral8b_full.py
Phi-3-Medium exe/qa/qa_phi3_full.py
Mistral-24B exe/qa/qa_mistral24b_full.py
Base + RAG Mistral-8B + RAG exe/qa/qa_mistral8b_rag.py
Phi-3-Medium + RAG exe/qa/qa_phi3_rag.py
Mistral-24B + RAG exe/qa/qa_mistral24b_rag.py
Commercial GPT-3.5-turbo(4K) exe/qa/qa_gpt35turbo_full.py
GPT-4o(16K) exe/qa/qa_gpt4o_full.py
Commercial + RAG GPT-3.5-turbo + RAG exe/qa/qa_gpt35turbo_rag.py
GPT-4o + RAG exe/qa/qa_gpt4o_rag.py

Table 4

Retrieval Granularity Top-k Answer Prediction Retrieval Accuracy
F1 Score (%) ↑ BERTScore (%) ↑ LLM-as-Judge (0-2) ↑ R@k (%) ↑ NDCG@k (0-2) ↑
Turn-level 10 exe/qa/qa_mistral24b_rag_turn_10.py exe/qa_retrieval/qa_retrieval_turn.py
20 exe/qa/qa_mistral24b_rag_turn_20.py
30 exe/qa/qa_mistral24b_rag_turn_30.py
Round-level 5 exe/qa/qa_mistral24b_rag_round_5.py exe/qa_retrieval/qa_retrieval_round.py
10 exe/qa/qa_mistral24b_rag_round_10.py
15 exe/qa/qa_mistral24b_rag_round_15.py
session-level 2 exe/qa/qa_mistral24b_rag_session_2.py exe/qa_retrieval/qa_retrieval_session.py
4 exe/qa/qa_mistral24b_rag.py
8 exe/qa/qa_mistral24b_rag_session_8.py

Table 5

Model Context F1 Score ↑ BERTScore ↑ LLM-as-Judge ↑
Mistral-8B 2K exe/qa/qa_mistral8b_full_2k.py
4K exe/qa/qa_mistral8b_full_4k.py
8K exe/qa/qa_mistral8b_full_8k.py
20K exe/qa/qa_mistral8b_full.py
Mistral-24B 2K exe/qa/qa_mistral24b_full_2k.py
4K exe/qa/qa_mistral24b_full_4k.py
8K exe/qa/qa_mistral24b_full_8k.py
20K exe/qa/qa_mistral24b_full.py

Table 6

Category Model ROUGE (%) ↑ Event-based Metrics (%) ↑ LLM Score (0-5) ↑
ROUGE-1 ROUGE-2 ROUGE-L Precision Recall F1
Base Mistral-8B exe/sum/sum_mistral8b_full.py
Phi-3-Medium exe/sum/sum_phi3_full.py
Mistral-24B exe/sum/sum_mistral24b_full.py
Base + RAG Mistral-8B + RAG exe/sum/sum_mistral8b_rag.py
Phi-3-Medium + RAG exe/sum/sum_phi3_rag.py
Mistral-24B + RAG exe/sum/sum_mistral24b_rag.py
Commercial GPT-3.5-turbo(4K) exe/sum/sum_gpt35turbo_full.py
GPT-4o(16K) exe/sum/sum_gpt4o_full.py
Commercial + RAG GPT-3.5-turbo + RAG exe/sum/sum_gpt35turbo_rag.py
GPT-4o + RAG exe/sum/sum_gpt4o_rag.py

Table 7

Memory Setting Model Recall ↑ Weighted Score ↑
No-Mem. Mistral-8B exe/dg/dg_mistral8b.py
Phi-3-Medium exe/dg/dg_phi3.py
Mistral-24B exe/dg/dg_mistral24b.py
GPT-3.5-turbo exe/dg/dg_gpt35turbo.py
GPT-4o exe/dg/dg_gpt4o.py
Full-Hist. Mistral-8B exe/dg/dg_mistral8b_full.py
Phi-3-Medium exe/dg/dg_phi3_full.py
Mistral-24B exe/dg/dg_mistral24b_full.py
GPT-3.5-turbo exe/dg/dg_gpt35turbo_full.py
GPT-4o exe/dg/dg_gpt4o_full.py
RAG Mistral-8B exe/dg/dg_mistral8b_rag.py
Phi-3-Medium exe/dg/dg_phi3_rag.py
Mistral-24B exe/dg/dg_mistral24b_rag.py
GPT-3.5-turbo exe/dg/dg_gpt35turbo_rag.py
GPT-4o exe/dg/dg_gpt4o_rag.py

Table 8

Memory Setting Model LT-Mem. ↑ Pers. ↑ ES ↑
No-Mem. Mistral-8B exe/dg/dg_mistral8b.py
Phi-3-Medium exe/dg/dg_phi3.py
Mistral-24B exe/dg/dg_mistral24b.py
GPT-3.5-turbo exe/dg/dg_gpt35turbo.py
GPT-4o exe/dg/dg_gpt4o.py
Full-Hist. Mistral-8B exe/dg/dg_mistral8b_full.py
Phi-3-Medium exe/dg/dg_phi3_full.py
Mistral-24B exe/dg/dg_mistral24b_full.py
GPT-3.5-turbo exe/dg/dg_gpt35turbo_full.py
GPT-4o exe/dg/dg_gpt4o_full.py
RAG Mistral-8B exe/dg/dg_mistral8b_rag.py
Phi-3-Medium exe/dg/dg_phi3_rag.py
Mistral-24B exe/dg/dg_mistral24b_rag.py
GPT-3.5-turbo exe/dg/dg_gpt35turbo_rag.py
GPT-4o exe/dg/dg_gpt4o_rag.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages