Skip to content

Conversation

@jwstanly
Copy link

For implementing infrastructure for multiple speakers, there are two main "pieces": diarization and dubbing.

  • Diarization is implemented through running an external library pyAudioAnalysis to complete speech diarization. After sentences have been transcribed through STT, pyAudioAnalysis analyses the audio and determines "who spoke when" throughout the video. Speaker information is then stored into each sentence. Additionally, each speaker has a gender, and each sentence also now stores this gender information.
  • Dubbing is implemented through modifying the original code base to create target voices for each new speaker in addition to each new language. Therefore, each sentence is dubbed through its own voice from the speaker and the speaker's gender.

For a more technical overview of the changes in this pull request, please refer to these bullet points below.

  • Sentence instances now have a speaker and gender instance variable. Speaker and gender are both unsigned integers, with their indexing starting at 1. For speakers, 1 correlates to the first speaker in the video, 2 to the second speaker, and so forth. For gender, 1 represents male, 2 is female, and 3 is unknown.
  • Client's get_audio_chunk_for_sentence() and get_target_voice() methods now have speaker and gender parameters that are specific to the sentence they are processing.
  • Client's target_voices dictionary is now keyed by a concatenation of lang code and speaker number. This way, either a new lang code OR a new speaker triggers a new target voice to be created.
  • Best voices json has been restructured to support different speakers and genders, as well as to increase human readability. This has been implemented through changes to client's save_best_voices() and load_best_voices() methods. Voices are now saved through save_best_voices() with params specific to the speaker and gender. Now load_best_voices() structures the voices dictionary like voices[platform][speaker][speaker_data], where the nested dictionary speaker stores the speaker number, and the innermost nested dictionary speaker_data now stores gender, locale, and voice_name explicitly.
  • The config no longer supports a gender variable. This is because gender is now dependent on each speaker and their sentences, not the overall pipeline. Therefore, client also no longer has a gender instance variable.
  • Additionally, the config file now asks for a num_of_speakers unsigned integer, which is the total number of unique speakers in the video, and determines the behavior of the project's speech diarization. If num_of_speakers is set to 1, speech diarization will not run (as it's not needed). If set to an integer above 1, speech diarization will run with the number of speakers predetermined. However, if number of speakers is unknown, setting num_of_speakers to 0 diarization will calculate the number of speakers itself (this leads to worse results though). Thus, its ideal that num_of_speakers is set in the config before runtime.
  • constants.py carries a new constant, PY_AUDIO_ANALYSIS_DATA_DIRECTORY, which is the directory where the wav audio file is stored for pyAudioAnalysis. Noticeably, this directory is within the scope of pyAudioAnalysis (which is inside /src), not /media.
  • Project now has two new methods that run during transcription, which run after project.transcribe_sentences() but before project.save_sentences(). Two new project instance variables were also added: an unsigned int num_of_speakers from the config, and a dictionary called speaker_genders for storing the gender of each speaker.
  • The first new project method, project.diarize_sentences(), traverses through each sentence and assigns the speaker and gender for that sentence. The method first uses FFmeg through os.system() to convert the flac into a wav for processing. This is then where the external pyAudioAnalysis library comes into play. After running the audio file through pyAudioAnalysis' speaker_diarization() method, a numpy array of unsigned integers is returned. These ints represent the speaker at that time, where the first element is at 0.1 seconds, and each subsequent element is 0.2 later (f(x)=0.2x+0.1). Noticeably, this numpy array of speakers is indexed from 0, but this will later be shifted to so they can be indexed from 1 to align with sentence's speaker instance variable. After the program asks for each speaker's gender, project.diarize_sentences() traverses through each sentence, finding the speakers inside the time interval of the sentence. Whichever speaker is the most prevalent in the sentence's time frame, the sentence's speaker and gender is assigned to that.
  • The second new project method verify_diarization() prints each sentence with its diarization results and asks the user to replace the diarization predictions with who actually spoke.

If anyone has any questions about this pull request, please feel free to reach out to me! Many of oratio's aspects had to be changed, so I want to make sure my descriptions for my changes are as clear as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant