This repository contains the code and dataset for the paper "Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models", presented at EMNLP 2024.
- Requirements
- Perception-Augmented Theory of Mind Datasets
- Evaluating Language Models on Perception-Augmented ToM Datasets
This project has been tested with Python 3.10.12. To install the required Python packages, run:
pip install -r requirements.txtFile: dataset/Percept-ToMi.csv
| Column | Description |
|---|---|
index |
Index of each story |
qTypeRaw |
Type of the question paired with the story (from original annotation of ToMi dataset) |
qType |
Type of the question paired with the story (with '0' and '1' removed from qTypeRaw) |
story |
Text of the story |
question |
Text of the question |
char1 |
The character whose first-order belief the question is asking about |
char2 |
The character whose second-order belief the question is asking about |
perceivers |
Perceivers of each scene in the story |
story_with_perceivers |
Scene descriptions in the story paired with corresponding perceivers. JSON-formatted to provide ground truth perception information to the language model. |
persp_ctx |
Perspective context corresponding to the story and the question |
cands |
Candidate answer choices |
answer |
Answer to the question |
- File:
dataset/Percept_FANToM/Percept-FANToM.csv
| Column | Description |
|---|---|
part_id |
ID of each conversation part in FANToM |
utterance |
Single utterance in a conversation part |
utterance_symbol_removed |
Utterance with symbols removed |
perceivers |
Perceivers of the utterance |
- File:
dataset/Percept_FANToM/Percept-FANToM-conv_with_perceivers.csv
| Column | Description |
|---|---|
part_id |
ID of each conversation part in FANToM |
conversation_with_perceivers |
Utterances in the conversation part paired with corresponding perceivers. Formatted in JSON to provide ground truth perception information to the language model. |
The experimental implementations for Percept-ToMi and Percept-FANToM are based on the implementations of SymbolicToM and FANToM, respectively. We acknowledge and appreciate the work of the authors of these papers, which served as the foundation for our extensions.
The implemented methods include the following:
vanilla: Vanilla language modelcot: Chain of Thoughts2a: System 2 Attentionperc_inf: Perception inferenceperceptom: PercepToMptob_inf: Perception-to-belief inferencesimtom: SimToM (implemented only for Percept-ToMi)
We acknowledge that the evaluation results for Percept-ToMi exhibit variability across runs. Therefore, we recommend conducting multiple evaluations and using the average performance as a more reliable measure.
Execute the evaluation script with the desired model and method:
python eval_percept_tomi.py --model {model_name} --method {method_name}Result files saved in results/Percept-ToMi/{model_name}-{method_name}/:
-
acc.csv: Question answering accuracies for four qTypes — first order/second order, true belief (no ToM) / false belief (ToM) -
acc_raw.csv: Question answering accuracies for eight qTypeRaws -
result.csv: Evaluated results of each datapoint, including prompts, model responses, and correctness
Notes:
-
For all methods except perception inference (
perc_inf), running this script performs language model inference and evaluation, producing all evaluation results. -
The perception inference method (
perc_inf) does not perform question answering. Running the script with this method only performs language model inference and produces result.csv.
To evaluate perception inference results, run the following script:
python eval_percept_tomi-perc_inf.py --model {model_name} --method {method_name}Result files saved in results/Percept-ToMi/{model_name}-{method_name}/ (the same directory as step 1):
-
scene_level_eval.csv: Model predictions and ground truth of scene-level perceivers -
ctx_level_eval.csv: Model predictions and ground truth of context (story)-level perceivers -
perc_inf_qType_acc.csv: Perception inference accuracies for four qTypes — first order/second order, true belief (no ToM) / false belief (ToM) -
perc_inf_scenario_acc.csv: Perception inference accuracies for two scenarios — true belief (no ToM) / false belief (ToM)
Execute the evaluation script with the desired model and method:
python eval_percept_fantom.py --model {model_name} --method {method_name}Result files saved in results/Percept-FANToM/{model_name}-{method_name}/:
-
REPORT_short_input_{model_name}.json: Question answering accuracies for different question types of FANToM for the contexts of false belief scenarios -
control_task_report_short_input_{model_name}.json: Question answering accuracies for different question types of FANToM for the contexts of true belief scenarios -
evaluated_responses_short_input_{model_name}.json/csv: Evaluated results of each datapoint, including prompts, model responses, and correctness
Notes:
-
For all methods except perception inference (
perc_inf), running this script performs language model inference and evaluation, producing all evaluation results. -
The perception inference method (
perc_inf) does not include the question answering step. Running the script with this method only performs language model inference and producesevaluated_responses_short_input_{model_name}.json/csv.
To evaluate perception inference results, run the following script:
python eval_percept_fantom-perc_inf.py --model {model_name} --method {method_name}Result files saved in results/Percept-FANToM/{model_name}-{method_name}/:
-
utter_level_eval.csv: Model predictions and ground truth of utterance-level perceivers -
ctx_level_eval.csv: Model predictions and ground truth of context (part of conversation)-level perceivers -
perc_inf_acc.csv: Perception inference accuracies for two scenarios — true belief (no ToM) / false belief (ToM)