Skip to content

The issue of the excessive generation length of correct/incorrect answers and ASR evaluation standards. #25

@KennyCaty

Description

@KennyCaty

Hi, your attack method in the work is very interesting, but I've noticed that if I run the gen_adv.py file from scratch, the generated correct answers and incorrect answers seem to be quite long. This appears to be inconsistent with the results you provided in adv_targeted_results. Could you please clarify if you used manual review or additional prompt constraints?

The LLM I am using is gpt-4o-mini, but I saw in the issues that someone using gpt-4 got the same result.

In addition, the step for generating the correct answer in gen_adv.py involves calling the LLM twice (once for a direct query and once with the ground truth document included) and comparing them using string matching. Due to the aforementioned issue, this approach seems to be resulting in a large number of queries failing the match and being skipped.

The aforementioned issue of overly long generated answers also leads to the failure of string-matching-based ASR evaluation, resulting in a significant drop in the ASR score.

Could you please provide suggestions on reproducing the experiment and on the results you provided?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions