Column	Description
`index`	Index of each story
`qTypeRaw`	Type of the question paired with the story (from original annotation of ToMi dataset)
`qType`	Type of the question paired with the story (with '0' and '1' removed from `qTypeRaw`)
`story`	Text of the story
`question`	Text of the question
`char1`	The character whose first-order belief the question is asking about
`char2`	The character whose second-order belief the question is asking about
`perceivers`	Perceivers of each scene in the story
`story_with_perceivers`	Scene descriptions in the story paired with corresponding perceivers. JSON-formatted to provide ground truth perception information to the language model.
`persp_ctx`	Perspective context corresponding to the story and the question
`cands`	Candidate answer choices
`answer`	Answer to the question

Percept-FANToM

File: dataset/Percept_FANToM/Percept-FANToM.csv

Column	Description
`part_id`	ID of each conversation part in FANToM
`utterance`	Single utterance in a conversation part
`utterance_symbol_removed`	Utterance with symbols removed
`perceivers`	Perceivers of the utterance

File: dataset/Percept_FANToM/Percept-FANToM-conv_with_perceivers.csv

Column	Description
`part_id`	ID of each conversation part in FANToM
`conversation_with_perceivers`	Utterances in the conversation part paired with corresponding perceivers. Formatted in JSON to provide ground truth perception information to the language model.

Evaluating Language Models on Perception-augmented ToM Datasets

The experimental implementations for Percept-ToMi and Percept-FANToM are based on the implementations of SymbolicToM and FANToM, respectively. We acknowledge and appreciate the work of the authors of these papers, which served as the foundation for our extensions.

The implemented methods include the following:

vanilla: Vanilla language model
cot: Chain of Thought
s2a: System 2 Attention
perc_inf: Perception inference
perceptom: PercepToM
ptob_inf: Perception-to-belief inference
simtom: SimToM (implemented only for Percept-ToMi)

Experiments on the Percept-ToMi

We acknowledge that the evaluation results for Percept-ToMi exhibit variability across runs. Therefore, we recommend conducting multiple evaluations and using the average performance as a more reliable measure.

1. Run Evaluation Script

Execute the evaluation script with the desired model and method:

python eval_percept_tomi.py --model {model_name} --method {method_name}

Result files saved in results/Percept-ToMi/{model_name}-{method_name}/:

acc.csv: Question answering accuracies for four qTypes — first order/second order, true belief (no ToM) / false belief (ToM)
acc_raw.csv: Question answering accuracies for eight qTypeRaws
result.csv: Evaluated results of each datapoint, including prompts, model responses, and correctness

Notes:

For all methods except perception inference (perc_inf), running this script performs language model inference and evaluation, producing all evaluation results.
The perception inference method (perc_inf) does not perform question answering. Running the script with this method only performs language model inference and produces result.csv.

2.Perception Inference Evaluation (Only for `perc_inf`)

To evaluate perception inference results, run the following script:

python eval_percept_tomi-perc_inf.py --model {model_name} --method {method_name}

Result files saved in results/Percept-ToMi/{model_name}-{method_name}/ (the same directory as step 1):

scene_level_eval.csv: Model predictions and ground truth of scene-level perceivers
ctx_level_eval.csv: Model predictions and ground truth of context (story)-level perceivers
perc_inf_qType_acc.csv: Perception inference accuracies for four qTypes — first order/second order, true belief (no ToM) / false belief (ToM)
perc_inf_scenario_acc.csv: Perception inference accuracies for two scenarios — true belief (no ToM) / false belief (ToM)

Experiments on the Percept-FANToM

1.Run Evaluation Script

Execute the evaluation script with the desired model and method:

python eval_percept_fantom.py --model {model_name} --method {method_name}

Result files saved in results/Percept-FANToM/{model_name}-{method_name}/:

REPORT_short_input_{model_name}.json: Question answering accuracies for different question types of FANToM for the contexts of false belief scenarios
control_task_report_short_input_{model_name}.json: Question answering accuracies for different question types of FANToM for the contexts of true belief scenarios
evaluated_responses_short_input_{model_name}.json/csv: Evaluated results of each datapoint, including prompts, model responses, and correctness

Notes:

For all methods except perception inference (perc_inf), running this script performs language model inference and evaluation, producing all evaluation results.
The perception inference method (perc_inf) does not include the question answering step. Running the script with this method only performs language model inference and produces evaluated_responses_short_input_{model_name}.json/csv.

2.Perception Inference Evaluation (Only for `perc_inf`)

To evaluate perception inference results, run the following script:

python eval_percept_fantom-perc_inf.py --model {model_name} --method {method_name}

Result files saved in results/Percept-FANToM/{model_name}-{method_name}/:

utter_level_eval.csv: Model predictions and ground truth of utterance-level perceivers
ctx_level_eval.csv: Model predictions and ground truth of context (part of conversation)-level perceivers
perc_inf_acc.csv: Perception inference accuracies for two scenarios — true belief (no ToM) / false belief (ToM)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dataset		dataset
README.md		README.md
eval_percept_fantom-perc_inf.py		eval_percept_fantom-perc_inf.py
eval_percept_fantom.py		eval_percept_fantom.py
eval_percept_tomi-perc_inf.py		eval_percept_tomi-perc_inf.py
eval_percept_tomi.py		eval_percept_tomi.py
llm_agents.py		llm_agents.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Requirements

Perception-Augmented Theory of Mind Datasets

Percept-ToMi

Percept-FANToM

Evaluating Language Models on Perception-augmented ToM Datasets

Experiments on the Percept-ToMi

1. Run Evaluation Script

2.Perception Inference Evaluation (Only for `perc_inf`)

Experiments on the Percept-FANToM

1.Run Evaluation Script

2.Perception Inference Evaluation (Only for `perc_inf`)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

chanijung/PercepToM

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Requirements

Perception-Augmented Theory of Mind Datasets

Percept-ToMi

Percept-FANToM

Evaluating Language Models on Perception-augmented ToM Datasets

Experiments on the Percept-ToMi

1. Run Evaluation Script

2.Perception Inference Evaluation (Only for perc_inf)

Experiments on the Percept-FANToM

1.Run Evaluation Script

2.Perception Inference Evaluation (Only for perc_inf)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

2.Perception Inference Evaluation (Only for `perc_inf`)

2.Perception Inference Evaluation (Only for `perc_inf`)

Packages