added infrastructure for multiple speakers #15
+5,548
−71
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For implementing infrastructure for multiple speakers, there are two main "pieces": diarization and dubbing.
For a more technical overview of the changes in this pull request, please refer to these bullet points below.
speakerandgenderinstance variable. Speaker and gender are both unsigned integers, with their indexing starting at 1. For speakers,1correlates to the first speaker in the video,2to the second speaker, and so forth. For gender,1represents male,2is female, and3is unknown.get_audio_chunk_for_sentence()andget_target_voice()methods now have speaker and gender parameters that are specific to the sentence they are processing.target_voicesdictionary is now keyed by a concatenation of lang code and speaker number. This way, either a new lang code OR a new speaker triggers a new target voice to be created.save_best_voices()andload_best_voices()methods. Voices are now saved throughsave_best_voices()with params specific to the speaker and gender. Nowload_best_voices()structures the voices dictionary likevoices[platform][speaker][speaker_data], where the nested dictionaryspeakerstores the speaker number, and the innermost nested dictionaryspeaker_datanow storesgender,locale, andvoice_nameexplicitly.clientalso no longer has agenderinstance variable.num_of_speakersunsigned integer, which is the total number of unique speakers in the video, and determines the behavior of the project's speech diarization. Ifnum_of_speakersis set to1, speech diarization will not run (as it's not needed). If set to an integer above1, speech diarization will run with the number of speakers predetermined. However, if number of speakers is unknown, settingnum_of_speakersto0diarization will calculate the number of speakers itself (this leads to worse results though). Thus, its ideal thatnum_of_speakersis set in the config before runtime.constants.pycarries a new constant,PY_AUDIO_ANALYSIS_DATA_DIRECTORY, which is the directory where the wav audio file is stored for pyAudioAnalysis. Noticeably, this directory is within the scope of pyAudioAnalysis (which is inside/src), not/media.project.transcribe_sentences()but beforeproject.save_sentences(). Two new project instance variables were also added: an unsigned intnum_of_speakersfrom the config, and a dictionary calledspeaker_gendersfor storing the gender of each speaker.project.diarize_sentences(), traverses through each sentence and assigns the speaker and gender for that sentence. The method first uses FFmeg throughos.system()to convert the flac into a wav for processing. This is then where the external pyAudioAnalysis library comes into play. After running the audio file through pyAudioAnalysis'speaker_diarization()method, a numpy array of unsigned integers is returned. These ints represent the speaker at that time, where the first element is at 0.1 seconds, and each subsequent element is 0.2 later (f(x)=0.2x+0.1). Noticeably, this numpy array of speakers is indexed from 0, but this will later be shifted to so they can be indexed from 1 to align with sentence'sspeakerinstance variable. After the program asks for each speaker's gender,project.diarize_sentences()traverses through each sentence, finding the speakers inside the time interval of the sentence. Whichever speaker is the most prevalent in the sentence's time frame, the sentence'sspeakerandgenderis assigned to that.verify_diarization()prints each sentence with its diarization results and asks the user to replace the diarization predictions with who actually spoke.If anyone has any questions about this pull request, please feel free to reach out to me! Many of oratio's aspects had to be changed, so I want to make sure my descriptions for my changes are as clear as possible.