Skip to content

Bug: Cannot correctly pronounce Vietnamese in speak() method when using TaskType.REPEAT #70

@blhai-rd

Description

@blhai-rd

Problem Description

I'm experiencing an issue when trying to make the avatar correctly pronounce Vietnamese text using the speak() method. Specifically, I'm integrating the StreamingAvatar SDK with OpenAI Assistant to respond to users in Vietnamese (following the guide at https://docs.heygen.com/docs/integrate-with-opeanai-assistant), but I'm facing a dilemma between two incompatible modes:

Current Behavior

  1. When initializing the avatar with language: "vi":

    • Initial greeting: The avatar speaks in Vietnamese with proper intonation (works well)
  2. When sending responses from OpenAI Assistant with speak():

    • Using TaskType.TALK: The avatar understands it's Vietnamese, but it CREATES NEW CONTENT instead of reading exactly what I provided
    • Using TaskType.REPEAT: The avatar reads the EXACT content, but pronounces it as if an English speaker is reading Vietnamese text (incorrect intonation and pronunciation)

Root Cause

After analyzing the SDK source code, I noticed:

  1. The language parameter is only set in createStartAvatar() and sent to the /v1/streaming.new endpoint
  2. When calling the speak() method, the SpeakRequest interface doesn't have a language parameter
  3. The request to /v1/streaming.task doesn't pass language information when using TaskType.REPEAT

Steps to Reproduce

Reproduction steps:

  1. Set up OpenAI Assistant and HeyGen integration according to the guide at https://docs.heygen.com/docs/integrate-with-opeanai-assistant

  2. Initialize the avatar with Vietnamese language:

await avatar.createStartAvatar({
  avatarName: "your-avatar",
  language: "vi",
  // (Other parameters)
});
  1. Get a response from OpenAI in Vietnamese and send it to the avatar:
const openAIResponse = await getOpenAIResponse(userMessage); // Vietnamese response from OpenAI
await avatar.speak({
  text: openAIResponse,
  taskType: TaskType.REPEAT
});
  1. Listen to the avatar's speech: The pronunciation will be accurate in terms of content, but with English intonation

Expected Behavior

The avatar should read the Vietnamese text accurately with natural Vietnamese intonation, especially when the language has been set to "vi" during initialization.

Environment

  • SDK Version: 2.0.14
  • Browser: Chrome 124.0.6367.60
  • Tested with the eleven_multilingual_v2 model

Proposed Solutions

I suggest one of the following changes:

  1. Add a language parameter to the SpeakRequest interface and pass it to the /v1/streaming.task endpoint
export interface SpeakRequest {
  text: string;
  taskType?: TaskType;
  taskMode?: TaskMode;
  language?: string; // Add this parameter
}
  1. Or store and use the language value from createStartAvatar() for all subsequent speak() requests

  2. Or provide a new method that combines both features: reading exact content AND correct language pronunciation

This would be especially important when integrating with APIs like OpenAI Assistant, where we need the avatar to read AI responses accurately with the proper intonation of that language.

Thank you for considering this issue. I'm available to provide more information or assist in testing any solutions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions