I need to upload an audio file where two or more speakers are having a conversation, and at times their speech overlaps. The requirement is to segment the audio into distinct chunks, each corresponding clearly to a different speaker(Which speaker say what).
I have used speaker diarization for this task, which includes pitch as a parameter for distinguishing speakers. However, when speakers have a similar pitch, the model fails to separate them correctly and treats them as the same person. What additional techniques or features can I incorporate to improve the diarization accuracy and handle overlapping speech more effectively?
Here is my code:
def run_speaker_diarization(audio_path):
print("inside speaker diarization method")
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", use_auth_token="huggingface_token")
diarization = pipeline(audio_path)
audio = AudioSegment.from_wav(audio_path)
# Output values
wav_splits = []
labels = []
# Optional: save chunks
base_dir = os.path.dirname(audio_path)
base_name = os.path.splitext(os.path.basename(audio_path))[0]
output_dir = os.path.join(base_dir, base_name + "_sliced_by_pyannote")
os.makedirs(output_dir, exist_ok=True)
for i, turn in enumerate(diarization.itertracks(yield_label=True)):
segment, _, speaker = turn
start = segment.start
end = segment.end
wav_splits.append((start, end)) # In seconds
labels.append(speaker)
# Optional: save chunk to file
chunk = audio[int(start * 1000):int(end * 1000)]
chunk_filename = f"{base_name}_Speaker{speaker}_chunk{i+1}.wav"
chunk_path = os.path.join(output_dir, chunk_filename)
chunk.export(chunk_path, format="wav")
print(f" Saved: {chunk_filename}")
print(f"\n All speaker segments saved to: {output_dir}")
-
Did you just provide us your authentication token ? Do not just edit this question, go delete it right now on hugging face.NiziL– NiziL2025年05月14日 13:07:53 +00:00Commented May 14, 2025 at 13:07