-
Notifications
You must be signed in to change notification settings - Fork 96
Description
gpt-4.1-mini/quality/Dalk: Accuracy = 0.3936
gpt-4.1-mini/Popqa/RAPTOR: Acc = 0.0222, EM = 0.0000, F1 = 0.0189, P = 0.0143, R = 0.0556
gpt-4o/multihop-rag/RAPTOR: Acc = 0.5814, EM = 0.0012, F1 = 0.0263, P = 0.0146, R = 0.3602
gpt-4o/multihop-rag/Dalk: Acc = 0.6491, EM = 0.0814, F1 = 0.1258, P = 0.1080, R = 0.3347
gpt-4o/multihop-rag/HippoRAG: Acc = 0.6463, EM = 0.0000, F1 = 0.0213, P = 0.0111, R = 0.3550
gpt-4o/quality/RAPTOR: Accuracy = 0.4752
gpt-4-turbo/multihop-rag/default: Acc = 0.4730, EM = 0.1166, F1 = 0.1752, P = 0.1585, R = 0.2452
Here is some results I have. The Acc and Recall are close to the table in the paper, but the EM and Precision are too low.
Do you have any suggestion for this issue? For example, do we need limit the output length prompts cause the generated answer is redundant now.