-
Notifications
You must be signed in to change notification settings - Fork 921
[Question] Details on negative-duration silence for barge-in/interruption simulation in training data #67
Description
Due diligence
- I have done my due diligence in trying to find the answer myself.
Topic
The paper
Question
Hi, thanks for the great work on PersonaPlex.
I'm studying the training data construction described in Section 3.2.2 of the paper, and I have a question about the following statement:
"When combining the 'user' and 'agent' dialogue turns, we can choose to add additional silence padding to simulate natural turn-taking. We observe that inserting negative-duration silence instead simulates barge-in and interruption. Prior work validates our methodology [12]."
The referenced prior work, SALM-Duplex [12], uses a fixed 0.64s positive silence between turns and a cutoff-based approach (keeping 0.64s of agent speech after cutoff) to simulate barge-in. However, SaLM-Duplex does not appear to use "negative-duration silence" (i.e., temporal overlap between speakers) itself.
I would appreciate clarification on the following:
-
Overlap criteria: What range or distribution of negative silence durations was used when stitching user and agent turns? (e.g., uniform sampling from [-x, 0] seconds, or a fixed overlap duration?)
-
Mixing positive and negative: Was the training data a mixture of both positive silence (gap) and negative silence (overlap) between turns, or was negative silence applied uniformly?
Any pointers to the data preparation pipeline or additional details would be very helpful.
Thanks!