why transcribe stage we remove N_FRAMES from mel and in for loop over the mel_segment it didn't take the last segment if it's less than 3000 frame why? let's suppose that he mel = [80,4100] first mel segment will be [80,3000], and [80,1100] the model will transcribe the first segment [80,3000] and in this [80,1100] it will not do any thing
# Pad 30-seconds of silence to the input audio, for slicing
mel = log_mel_spectrogram(audio, model.dims.n_mels, padding=N_SAMPLES)
content_frames = mel.shape[-1] - N_FRAMES # N_FRAMES = 3000
content_duration = float(content_frames * HOP_LENGTH / SAMPLE_RATE)
asked Feb 8, 2024 at 20:53
AbdElRhaman Fakhrygmailcom
394 bronze badges
-
Maybe because the model requires a minimum number of frames to generate accurate transcriptions or to ensure that there is sufficient context for the model to process. Possibly this helps to maintain the accuracy and quality of the transcriptions generated by the model. If the last segment were transcribed even though it's shorter, it might not provide enough context for the model to generate accurate transcriptions, leading to potentially less reliable results.Milos Stojanovic– Milos Stojanovic2024年02月08日 21:13:43 +00:00Commented Feb 8, 2024 at 21:13
-
You need to pad the last frames with silence.anon– anon2024年04月08日 05:40:17 +00:00Commented Apr 8, 2024 at 5:40
lang-py