This repository contains the dataset and code for our paper: "ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support".
To execute the code in this repository, please install the required dependencies using Anaconda with the following commands:
conda create -n ES_MemEval python=3.13.5
conda activate ES_MemEval
pip install -r requirements.txtThe experiments were conducted on the following environments and have been verified to run successfully on various configurations:
-
Operating System: Ubuntu 18.04.5 LTS (also tested on Ubuntu 24.04.3 LTS)
-
GPU: NVIDIA TITAN RTX / GeForce RTX 3080 / A100
-
CUDA Version: 12.4 (also verified with CUDA 13.0)
The PYTHONPATH environment variable should be configured to point to the src directory to ensure correct module imports :
export PYTHONPATH="./src" # Should be replaced with an absolute path, unless the working directory is fixed to the repository root. (In VSCode, this will be automatically configured.)All project-level configurations are defined in src/exe/common_configurations.py, which specifies key parameters such as API endpoints for large language models (LLMs), data and output paths, and other global settings.
By default, the OpenAI API key is loaded from
./secrets/open_ai_api_key.txt (a relative path).
To modify this behavior, locate and update the following expression in common_configurations.py:
_csfile.read_all_text("./secrets/open_ai_api_key.txt")Likewise, the default configuration defines other LLM endpoints as locally hosted on _vllm_host, exposed through OpenAI-compatible APIs on ports 8911–8913.
These parameters can be modified in common_configurations.py if alternative hosts or ports are required.
To deploy equivalent services locally, the models can be launched using vLLM as illustrated below:
# Create a vLLM environment
conda create -n ES_MemEval_vLLM python=3.12
conda activate ES_MemEval_vLLM
pip install uv
uv pip install vllm --torch-backend=auto
# Some models may require the signing of agreements
hf auth login
# Download and serve the models
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --port 8911
vllm serve mistralai/Ministral-8B-Instruct-2410 --port 8912
vllm serve microsoft/Phi-3-medium-128k-instruct --port 8913All experimental scripts support multiprocessing. Setting multiprocessing_workers to a value greater than zero enables parallel execution. However, during the first run, model downloads may occur; therefore, it is recommended to first run with multiprocessing_workers = 0 to verify correctness, and then restart the experiment with multiprocessing enabled.
To avoid console clutter caused by simultaneous process outputs, the standard output is suppressed during execution. Only summary messages are displayed before and after the experiment. Some models may print progress logs during their initial download phase, which can temporarily dominate the console output. If the model download has completed but no new logs appear, the experiment is likely already in progress.
The EvoEmo dataset is provided in this repository at data/evo_emo.json.
For external Python projects, data loading utilities are available in src/lib/shared/data_provider.
After completing the environment setup, experiments can be executed by running the corresponding scripts located in the exe directory. For example, to evaluate the question-answering task using the Mistral-8B model with full dialogue history, run:
python ./src/exe/qa/qa_mistral8b_full.pyBy default, a directory named by current datetime will be created in ./outputs/qa_mistral8b_full. Result of each seeker will be saved as a csv file in a sub-directory named with the seeker's id. The contents of this file will be updated in real time to show the current progress. After running, you can check exception.csv to see if any exceptions occurred in each process; the final results of all seekers are merged into result.csv.
Besides common_configurations.py, to ensure that configuration can be modified independently of other scripts, there will be a Config class in any executable script. Its scope is limited to this script.
Here is a list showing the execute scripts, and its relationship to our manuscript.
| Category | Model | F1 Score (%) ↑ | BERTScore (%) ↑ | LLM-as-Judge (0-2) ↑ | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IE | TR | CD | Abs | UM | All | IE | TR | CD | Abs | UM | All | IE | TR | CD | Abs | UM | All | ||
| Base | Mistral-8B | exe/qa/qa_mistral8b_full.py |
|||||||||||||||||
| Phi-3-Medium | exe/qa/qa_phi3_full.py |
||||||||||||||||||
| Mistral-24B | exe/qa/qa_mistral24b_full.py |
||||||||||||||||||
| Base + RAG | Mistral-8B + RAG | exe/qa/qa_mistral8b_rag.py |
|||||||||||||||||
| Phi-3-Medium + RAG | exe/qa/qa_phi3_rag.py |
||||||||||||||||||
| Mistral-24B + RAG | exe/qa/qa_mistral24b_rag.py |
||||||||||||||||||
| Commercial | GPT-3.5-turbo(4K) | exe/qa/qa_gpt35turbo_full.py |
|||||||||||||||||
| GPT-4o(16K) | exe/qa/qa_gpt4o_full.py |
||||||||||||||||||
| Commercial + RAG | GPT-3.5-turbo + RAG | exe/qa/qa_gpt35turbo_rag.py |
|||||||||||||||||
| GPT-4o + RAG | exe/qa/qa_gpt4o_rag.py |
||||||||||||||||||
| Retrieval Granularity | Top-k | Answer Prediction | Retrieval Accuracy | |||
|---|---|---|---|---|---|---|
| F1 Score (%) ↑ | BERTScore (%) ↑ | LLM-as-Judge (0-2) ↑ | R@k (%) ↑ | NDCG@k (0-2) ↑ | ||
| Turn-level | 10 | exe/qa/qa_mistral24b_rag_turn_10.py |
exe/qa_retrieval/qa_retrieval_turn.py |
|||
| 20 | exe/qa/qa_mistral24b_rag_turn_20.py |
|||||
| 30 | exe/qa/qa_mistral24b_rag_turn_30.py |
|||||
| Round-level | 5 | exe/qa/qa_mistral24b_rag_round_5.py |
exe/qa_retrieval/qa_retrieval_round.py |
|||
| 10 | exe/qa/qa_mistral24b_rag_round_10.py |
|||||
| 15 | exe/qa/qa_mistral24b_rag_round_15.py |
|||||
| session-level | 2 | exe/qa/qa_mistral24b_rag_session_2.py |
exe/qa_retrieval/qa_retrieval_session.py |
|||
| 4 | exe/qa/qa_mistral24b_rag.py |
|||||
| 8 | exe/qa/qa_mistral24b_rag_session_8.py |
|||||
| Model | Context | F1 Score ↑ | BERTScore ↑ | LLM-as-Judge ↑ |
|---|---|---|---|---|
| Mistral-8B | 2K | exe/qa/qa_mistral8b_full_2k.py |
||
| 4K | exe/qa/qa_mistral8b_full_4k.py |
|||
| 8K | exe/qa/qa_mistral8b_full_8k.py |
|||
| 20K | exe/qa/qa_mistral8b_full.py |
|||
| Mistral-24B | 2K | exe/qa/qa_mistral24b_full_2k.py |
||
| 4K | exe/qa/qa_mistral24b_full_4k.py |
|||
| 8K | exe/qa/qa_mistral24b_full_8k.py |
|||
| 20K | exe/qa/qa_mistral24b_full.py |
|||
| Category | Model | ROUGE (%) ↑ | Event-based Metrics (%) ↑ | LLM Score (0-5) ↑ | ||||
|---|---|---|---|---|---|---|---|---|
| ROUGE-1 | ROUGE-2 | ROUGE-L | Precision | Recall | F1 | |||
| Base | Mistral-8B | exe/sum/sum_mistral8b_full.py |
||||||
| Phi-3-Medium | exe/sum/sum_phi3_full.py |
|||||||
| Mistral-24B | exe/sum/sum_mistral24b_full.py |
|||||||
| Base + RAG | Mistral-8B + RAG | exe/sum/sum_mistral8b_rag.py |
||||||
| Phi-3-Medium + RAG | exe/sum/sum_phi3_rag.py |
|||||||
| Mistral-24B + RAG | exe/sum/sum_mistral24b_rag.py |
|||||||
| Commercial | GPT-3.5-turbo(4K) | exe/sum/sum_gpt35turbo_full.py |
||||||
| GPT-4o(16K) | exe/sum/sum_gpt4o_full.py |
|||||||
| Commercial + RAG | GPT-3.5-turbo + RAG | exe/sum/sum_gpt35turbo_rag.py |
||||||
| GPT-4o + RAG | exe/sum/sum_gpt4o_rag.py |
|||||||
| Memory Setting | Model | Recall ↑ | Weighted Score ↑ |
|---|---|---|---|
| No-Mem. | Mistral-8B | exe/dg/dg_mistral8b.py |
|
| Phi-3-Medium | exe/dg/dg_phi3.py |
||
| Mistral-24B | exe/dg/dg_mistral24b.py |
||
| GPT-3.5-turbo | exe/dg/dg_gpt35turbo.py |
||
| GPT-4o | exe/dg/dg_gpt4o.py |
||
| Full-Hist. | Mistral-8B | exe/dg/dg_mistral8b_full.py |
|
| Phi-3-Medium | exe/dg/dg_phi3_full.py |
||
| Mistral-24B | exe/dg/dg_mistral24b_full.py |
||
| GPT-3.5-turbo | exe/dg/dg_gpt35turbo_full.py |
||
| GPT-4o | exe/dg/dg_gpt4o_full.py |
||
| RAG | Mistral-8B | exe/dg/dg_mistral8b_rag.py |
|
| Phi-3-Medium | exe/dg/dg_phi3_rag.py |
||
| Mistral-24B | exe/dg/dg_mistral24b_rag.py |
||
| GPT-3.5-turbo | exe/dg/dg_gpt35turbo_rag.py |
||
| GPT-4o | exe/dg/dg_gpt4o_rag.py |
||
| Memory Setting | Model | LT-Mem. ↑ | Pers. ↑ | ES ↑ |
|---|---|---|---|---|
| No-Mem. | Mistral-8B | exe/dg/dg_mistral8b.py |
||
| Phi-3-Medium | exe/dg/dg_phi3.py |
|||
| Mistral-24B | exe/dg/dg_mistral24b.py |
|||
| GPT-3.5-turbo | exe/dg/dg_gpt35turbo.py |
|||
| GPT-4o | exe/dg/dg_gpt4o.py |
|||
| Full-Hist. | Mistral-8B | exe/dg/dg_mistral8b_full.py |
||
| Phi-3-Medium | exe/dg/dg_phi3_full.py |
|||
| Mistral-24B | exe/dg/dg_mistral24b_full.py |
|||
| GPT-3.5-turbo | exe/dg/dg_gpt35turbo_full.py |
|||
| GPT-4o | exe/dg/dg_gpt4o_full.py |
|||
| RAG | Mistral-8B | exe/dg/dg_mistral8b_rag.py |
||
| Phi-3-Medium | exe/dg/dg_phi3_rag.py |
|||
| Mistral-24B | exe/dg/dg_mistral24b_rag.py |
|||
| GPT-3.5-turbo | exe/dg/dg_gpt35turbo_rag.py |
|||
| GPT-4o | exe/dg/dg_gpt4o_rag.py |
|||