Skip to content

[Question] Details on negative-duration silence for barge-in/interruption simulation in training data #67

@jungwon-choi

Description

@jungwon-choi

Due diligence

  • I have done my due diligence in trying to find the answer myself.

Topic

The paper

Question

Hi, thanks for the great work on PersonaPlex.

I'm studying the training data construction described in Section 3.2.2 of the paper, and I have a question about the following statement:

"When combining the 'user' and 'agent' dialogue turns, we can choose to add additional silence padding to simulate natural turn-taking. We observe that inserting negative-duration silence instead simulates barge-in and interruption. Prior work validates our methodology [12]."

The referenced prior work, SALM-Duplex [12], uses a fixed 0.64s positive silence between turns and a cutoff-based approach (keeping 0.64s of agent speech after cutoff) to simulate barge-in. However, SaLM-Duplex does not appear to use "negative-duration silence" (i.e., temporal overlap between speakers) itself.

I would appreciate clarification on the following:

  1. Overlap criteria: What range or distribution of negative silence durations was used when stitching user and agent turns? (e.g., uniform sampling from [-x, 0] seconds, or a fixed overlap duration?)

  2. Mixing positive and negative: Was the training data a mixture of both positive silence (gap) and negative silence (overlap) between turns, or was negative silence applied uniformly?

Any pointers to the data preparation pipeline or additional details would be very helpful.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions