Speaker identification embeddings audio fragment length

Question 1

I have a base of audio samples matched with concrete speaker like

nick_sample1.mp3 nick_sample2.mp3 ... nick_sampleN.mp3

john_sample1.mp3 john_sample2.mp3 ... john_sampleK.mp3

The task is to match a given sampleX.mp3 with one of the known speakers (or none of them). SampleX.mp3 is itself a result of diarization process, so it most likely contains 1 speaker in my case. My current idea is to break known samples into fragments of equal length and calculate embeddings (pyannote). Then train classifier for each speaker (not sure which one to use at the moment). The classifier will say likelyhood for a given embedding to belong to say Nick.

So the identification process is the following:

break sampleX.mp3 into fragments
each fragment's embeddings go through each classifier
calculate likelyhood score for each speaker, the largest wins and is considered the speaker in sampleX

Questions:

How to break sampleX.mp3 into fragments, is there a guideline or smth?
What is the best option for classfier?

Question 2

Your overall approach is good. However for speaker identification using embeddings, the standard approach would be to use a distance function on the vectors - not a classifier model. A commonly used distance function for embeddings is cosine distance. Procedure: Compute the distance to all known speaker samples. If there are no matches below a certain distance, then return 'unknown'. Otherwise, return the closest match.

For splitting into sections, you cut the audio samples. Usually it is represented as numpy array. You compute the start and end sample indices, by taking a time in seconds and multiplying by the samplerate (and convert to integer). You may want to use overlapping fragments.

Question 3

Thanks, what would you consider a good time for a fragment? I mean 0.5 sec or less, or more? Is there some rule for it?

Question 4

"Speaker" is a property that does not change over time. So then you want to keep the fragment window as long as you can, while keeping the probability of multiple persons speaking at the same time acceptably low. Best is to try a couple values, and check results on validation set. Use overlapped fragments to keep sufficient time resolution on your output.

Jon Nordby 6,4441 gold badge24 silver badges53 bronze badges · Accepted Answer · 2024-07-26 11:25:29Z

Your overall approach is good. However for speaker identification using embeddings, the standard approach would be to use a distance function on the vectors - not a classifier model. A commonly used distance function for embeddings is cosine distance. Procedure: Compute the distance to all known speaker samples. If there are no matches below a certain distance, then return 'unknown'. Otherwise, return the closest match.

For splitting into sections, you cut the audio samples. Usually it is represented as numpy array. You compute the start and end sample indices, by taking a time in seconds and multiplying by the samplerate (and convert to integer). You may want to use overlapping fragments.

Thanks, what would you consider a good time for a fragment? I mean 0.5 sec or less, or more? Is there some rule for it?
"Speaker" is a property that does not change over time. So then you want to keep the fragment window as long as you can, while keeping the probability of multiple persons speaking at the same time acceptably low. Best is to try a couple values, and check results on validation set. Use overlapped fragments to keep sufficient time resolution on your output.

CollectivesTM on Stack Overflow

Speaker identification embeddings audio fragment length

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related