ES-MemEval

This repository contains the dataset and code for our paper: "ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support".

Enviroment Setup

Dependency Installation

To execute the code in this repository, please install the required dependencies using Anaconda with the following commands:

conda create -n ES_MemEval python=3.13.5
conda activate ES_MemEval
pip install -r requirements.txt

The experiments were conducted on the following environments and have been verified to run successfully on various configurations:

Operating System: Ubuntu 18.04.5 LTS (also tested on Ubuntu 24.04.3 LTS)
GPU: NVIDIA TITAN RTX / GeForce RTX 3080 / A100
CUDA Version: 12.4 (also verified with CUDA 13.0)

Python Configuration

The PYTHONPATH environment variable should be configured to point to the src directory to ensure correct module imports :

export PYTHONPATH="./src"    # Should be replaced with an absolute path, unless the working directory is fixed to the repository root. (In VSCode, this will be automatically configured.)

Project Configuration

All project-level configurations are defined in src/exe/common_configurations.py, which specifies key parameters such as API endpoints for large language models (LLMs), data and output paths, and other global settings.

LLM Configuation

By default, the OpenAI API key is loaded from ./secrets/open_ai_api_key.txt (a relative path). To modify this behavior, locate and update the following expression in common_configurations.py:

_csfile.read_all_text("./secrets/open_ai_api_key.txt")

Likewise, the default configuration defines other LLM endpoints as locally hosted on _vllm_host, exposed through OpenAI-compatible APIs on ports 8911–8913.

These parameters can be modified in common_configurations.py if alternative hosts or ports are required.

To deploy equivalent services locally, the models can be launched using vLLM as illustrated below:

# Create a vLLM environment
conda create -n ES_MemEval_vLLM python=3.12
conda activate ES_MemEval_vLLM
pip install uv
uv pip install vllm --torch-backend=auto

# Some models may require the signing of agreements
hf auth login

# Download and serve the models
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --port 8911
vllm serve mistralai/Ministral-8B-Instruct-2410 --port 8912
vllm serve microsoft/Phi-3-medium-128k-instruct --port 8913

Multiprocessing

All experimental scripts support multiprocessing. Setting multiprocessing_workers to a value greater than zero enables parallel execution. However, during the first run, model downloads may occur; therefore, it is recommended to first run with multiprocessing_workers = 0 to verify correctness, and then restart the experiment with multiprocessing enabled.

To avoid console clutter caused by simultaneous process outputs, the standard output is suppressed during execution. Only summary messages are displayed before and after the experiment. Some models may print progress logs during their initial download phase, which can temporarily dominate the console output. If the model download has completed but no new logs appear, the experiment is likely already in progress.

Dataset Access

The EvoEmo dataset is provided in this repository at data/evo_emo.json.

For external Python projects, data loading utilities are available in src/lib/shared/data_provider.

Running the Code

After completing the environment setup, experiments can be executed by running the corresponding scripts located in the exe directory. For example, to evaluate the question-answering task using the Mistral-8B model with full dialogue history, run:

python ./src/exe/qa/qa_mistral8b_full.py

By default, a directory named by current datetime will be created in ./outputs/qa_mistral8b_full. Result of each seeker will be saved as a csv file in a sub-directory named with the seeker's id. The contents of this file will be updated in real time to show the current progress. After running, you can check exception.csv to see if any exceptions occurred in each process; the final results of all seekers are merged into result.csv.

Script Configuration

Besides common_configurations.py, to ensure that configuration can be modified independently of other scripts, there will be a Config class in any executable script. Its scope is limited to this script.

Experiment List

Here is a list showing the execute scripts, and its relationship to our manuscript.

Table 3

Category	Model	F1 Score (%) ↑						BERTScore (%) ↑						LLM-as-Judge (0-2) ↑
Category	Model	IE	TR	CD	Abs	UM	All	IE	TR	CD	Abs	UM	All	IE	TR	CD	Abs	UM	All
Base	Mistral-8B	`exe/qa/qa_mistral8b_full.py`
	Phi-3-Medium	`exe/qa/qa_phi3_full.py`
	Mistral-24B	`exe/qa/qa_mistral24b_full.py`
Base + RAG	Mistral-8B + RAG	`exe/qa/qa_mistral8b_rag.py`
	Phi-3-Medium + RAG	`exe/qa/qa_phi3_rag.py`
	Mistral-24B + RAG	`exe/qa/qa_mistral24b_rag.py`
Commercial	GPT-3.5-turbo(4K)	`exe/qa/qa_gpt35turbo_full.py`
Commercial	GPT-4o(16K)	`exe/qa/qa_gpt4o_full.py`
Commercial + RAG	GPT-3.5-turbo + RAG	`exe/qa/qa_gpt35turbo_rag.py`
Commercial + RAG	GPT-4o + RAG	`exe/qa/qa_gpt4o_rag.py`

Table 4

Retrieval Granularity	Top-k	Answer Prediction			Retrieval Accuracy
Retrieval Granularity	Top-k	F1 Score (%) ↑	BERTScore (%) ↑	LLM-as-Judge (0-2) ↑	R@k (%) ↑	NDCG@k (0-2) ↑
Turn-level	10	`exe/qa/qa_mistral24b_rag_turn_10.py`			`exe/qa_retrieval/qa_retrieval_turn.py`
	20	`exe/qa/qa_mistral24b_rag_turn_20.py`
	30	`exe/qa/qa_mistral24b_rag_turn_30.py`
Round-level	5	`exe/qa/qa_mistral24b_rag_round_5.py`			`exe/qa_retrieval/qa_retrieval_round.py`
	10	`exe/qa/qa_mistral24b_rag_round_10.py`
	15	`exe/qa/qa_mistral24b_rag_round_15.py`
session-level	2	`exe/qa/qa_mistral24b_rag_session_2.py`			`exe/qa_retrieval/qa_retrieval_session.py`
	4	`exe/qa/qa_mistral24b_rag.py`
	8	`exe/qa/qa_mistral24b_rag_session_8.py`

Table 5

Model	Context	F1 Score ↑
Mistral-8B	2K	`exe/qa/qa_mistral8b_full_2k.py`
	4K	`exe/qa/qa_mistral8b_full_4k.py`
	8K	`exe/qa/qa_mistral8b_full_8k.py`
	20K	`exe/qa/qa_mistral8b_full.py`
Mistral-24B	2K	`exe/qa/qa_mistral24b_full_2k.py`
	4K	`exe/qa/qa_mistral24b_full_4k.py`
	8K	`exe/qa/qa_mistral24b_full_8k.py`
	20K	`exe/qa/qa_mistral24b_full.py`

Table 6

Category	Model	ROUGE (%) ↑			Event-based Metrics (%) ↑			LLM Score (0-5) ↑
Category	Model	ROUGE-1	ROUGE-2	ROUGE-L	Precision	Recall	F1	LLM Score (0-5) ↑
Base	Mistral-8B	`exe/sum/sum_mistral8b_full.py`
	Phi-3-Medium	`exe/sum/sum_phi3_full.py`
	Mistral-24B	`exe/sum/sum_mistral24b_full.py`
Base + RAG	Mistral-8B + RAG	`exe/sum/sum_mistral8b_rag.py`
	Phi-3-Medium + RAG	`exe/sum/sum_phi3_rag.py`
	Mistral-24B + RAG	`exe/sum/sum_mistral24b_rag.py`
Commercial	GPT-3.5-turbo(4K)	`exe/sum/sum_gpt35turbo_full.py`
Commercial	GPT-4o(16K)	`exe/sum/sum_gpt4o_full.py`
Commercial + RAG	GPT-3.5-turbo + RAG	`exe/sum/sum_gpt35turbo_rag.py`
Commercial + RAG	GPT-4o + RAG	`exe/sum/sum_gpt4o_rag.py`

Table 7

Memory Setting	Model	Recall ↑
No-Mem.	Mistral-8B	`exe/dg/dg_mistral8b.py`
	Phi-3-Medium	`exe/dg/dg_phi3.py`
	Mistral-24B	`exe/dg/dg_mistral24b.py`
	GPT-3.5-turbo	`exe/dg/dg_gpt35turbo.py`
	GPT-4o	`exe/dg/dg_gpt4o.py`
Full-Hist.	Mistral-8B	`exe/dg/dg_mistral8b_full.py`
	Phi-3-Medium	`exe/dg/dg_phi3_full.py`
	Mistral-24B	`exe/dg/dg_mistral24b_full.py`
	GPT-3.5-turbo	`exe/dg/dg_gpt35turbo_full.py`
	GPT-4o	`exe/dg/dg_gpt4o_full.py`
RAG	Mistral-8B	`exe/dg/dg_mistral8b_rag.py`
	Phi-3-Medium	`exe/dg/dg_phi3_rag.py`
	Mistral-24B	`exe/dg/dg_mistral24b_rag.py`
	GPT-3.5-turbo	`exe/dg/dg_gpt35turbo_rag.py`
	GPT-4o	`exe/dg/dg_gpt4o_rag.py`

Table 8

Memory Setting	Model	LT-Mem. ↑
No-Mem.	Mistral-8B	`exe/dg/dg_mistral8b.py`
	Phi-3-Medium	`exe/dg/dg_phi3.py`
	Mistral-24B	`exe/dg/dg_mistral24b.py`
	GPT-3.5-turbo	`exe/dg/dg_gpt35turbo.py`
	GPT-4o	`exe/dg/dg_gpt4o.py`
Full-Hist.	Mistral-8B	`exe/dg/dg_mistral8b_full.py`
	Phi-3-Medium	`exe/dg/dg_phi3_full.py`
	Mistral-24B	`exe/dg/dg_mistral24b_full.py`
	GPT-3.5-turbo	`exe/dg/dg_gpt35turbo_full.py`
	GPT-4o	`exe/dg/dg_gpt4o_full.py`
RAG	Mistral-8B	`exe/dg/dg_mistral8b_rag.py`
	Phi-3-Medium	`exe/dg/dg_phi3_rag.py`
	Mistral-24B	`exe/dg/dg_mistral24b_rag.py`
	GPT-3.5-turbo	`exe/dg/dg_gpt35turbo_rag.py`
	GPT-4o	`exe/dg/dg_gpt4o_rag.py`

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
outputs		outputs
secrets		secrets
src		src
README.md		README.md
es_mem_eval.code-workspace		es_mem_eval.code-workspace
python.env		python.env
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ES-MemEval

Enviroment Setup

Dependency Installation

Python Configuration

Project Configuration

LLM Configuation

Multiprocessing

Dataset Access

Running the Code

Script Configuration

Experiment List

Table 3

Table 4

Table 5

Table 6

Table 7

Table 8

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ES-MemEval

Enviroment Setup

Dependency Installation

Python Configuration

Project Configuration

LLM Configuation

Multiprocessing

Dataset Access

Running the Code

Script Configuration

Experiment List

Table 3

Table 4

Table 5

Table 6

Table 7

Table 8

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages