When using convert_result_to_excel to export results, the question, expected answer, model output, and error_reason columns are misaligned in the Excel file. This makes it look like the model output and ground truth are unrelated.
The root cause is that the code aligns error info by line index instead of by sample id. The score file (*_score.json) stores only the summary on the first line and then only wrong samples afterward, so (line index → dataset index) is not a valid mapping. As a result, flag and error_reason are written to the wrong rows.
The evaluation metrics (accuracy, correct_count, total_count) are correct; the bug only affects the Excel visualization. A proper fix would be to build a mapping from id to row index in prompt_list, and then, when reading score_file, use data["id"] to locate the correct row before setting flag and error_reason.