0

I have a base of audio samples matched with concrete speaker like

nick_sample1.mp3 nick_sample2.mp3 ... nick_sampleN.mp3

john_sample1.mp3 john_sample2.mp3 ... john_sampleK.mp3

The task is to match a given sampleX.mp3 with one of the known speakers (or none of them). SampleX.mp3 is itself a result of diarization process, so it most likely contains 1 speaker in my case. My current idea is to break known samples into fragments of equal length and calculate embeddings (pyannote). Then train classifier for each speaker (not sure which one to use at the moment). The classifier will say likelyhood for a given embedding to belong to say Nick.

So the identification process is the following:

  1. break sampleX.mp3 into fragments
  2. each fragment's embeddings go through each classifier
  3. calculate likelyhood score for each speaker, the largest wins and is considered the speaker in sampleX

Questions:

  1. How to break sampleX.mp3 into fragments, is there a guideline or smth?
  2. What is the best option for classfier?
asked Jul 2, 2024 at 8:53

1 Answer 1

1

Your overall approach is good. However for speaker identification using embeddings, the standard approach would be to use a distance function on the vectors - not a classifier model. A commonly used distance function for embeddings is cosine distance. Procedure: Compute the distance to all known speaker samples. If there are no matches below a certain distance, then return 'unknown'. Otherwise, return the closest match.

For splitting into sections, you cut the audio samples. Usually it is represented as numpy array. You compute the start and end sample indices, by taking a time in seconds and multiplying by the samplerate (and convert to integer). You may want to use overlapping fragments.

answered Jul 26, 2024 at 11:25
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, what would you consider a good time for a fragment? I mean 0.5 sec or less, or more? Is there some rule for it?
"Speaker" is a property that does not change over time. So then you want to keep the fragment window as long as you can, while keeping the probability of multiple persons speaking at the same time acceptably low. Best is to try a couple values, and check results on validation set. Use overlapped fragments to keep sufficient time resolution on your output.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.