0

why transcribe stage we remove N_FRAMES from mel and in for loop over the mel_segment it didn't take the last segment if it's less than 3000 frame why? let's suppose that he mel = [80,4100] first mel segment will be [80,3000], and [80,1100] the model will transcribe the first segment [80,3000] and in this [80,1100] it will not do any thing


# Pad 30-seconds of silence to the input audio, for slicing
 mel = log_mel_spectrogram(audio, model.dims.n_mels, padding=N_SAMPLES)
 content_frames = mel.shape[-1] - N_FRAMES # N_FRAMES = 3000
 content_duration = float(content_frames * HOP_LENGTH / SAMPLE_RATE)
2
  • Maybe because the model requires a minimum number of frames to generate accurate transcriptions or to ensure that there is sufficient context for the model to process. Possibly this helps to maintain the accuracy and quality of the transcriptions generated by the model. If the last segment were transcribed even though it's shorter, it might not provide enough context for the model to generate accurate transcriptions, leading to potentially less reliable results. Commented Feb 8, 2024 at 21:13
  • You need to pad the last frames with silence. Commented Apr 8, 2024 at 5:40

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.