-
Notifications
You must be signed in to change notification settings - Fork 4.6k
whisper-cli : align token timestamps with VAD ts #3218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b23c671
to
75db936
Compare
chriswang-
commented
Jun 16, 2025
Has this issue been resolved? It seems it hasn't been merged into the main branch, or has it already been fixed in the branch (vad-token-timestamp-alignment) that I can use it ?
Has this issue been resolved?
No, it has not been resolved yet. I changed it to a draft (which might have sent a notification) as I noticed the token level timestamps are still not correct and I need to revisit this.
75db936
to
12e44a1
Compare
@chriswang- It would be great if you could try this out with the audio sample in your original issue report.
chriswang-
commented
Jun 16, 2025
@danbev Sorry The issue is not commited by me, But I can try to verify it .
@chriswang- Ah my bad, I should have checked to be sure and not just assumed.
chriswang-
commented
Jun 16, 2025
subtitle-master-with-vad.json
subtitle-master-without-vad.json
subtitle-PR.json
@danbev
I have uploaded three files: one with the Master branch result that includes the VAD feature, one without the VAD feature, and the third using your PR for transcription. After a brief comparison, it seems the issue has been resolved. However, I really forgot all the testing context and related information from when I first discovered the bug. I only noticed the bug and confirmed that the same issue exists on GitHub.
This commit aligns the token timestamps with the VAD timestamps when VAD is enabled. The motivation of this is that currently the token timestamps that are reported in the full json output are the timestamps that whisper sees after the VAD has processed the audio. This means that whisper only sees possibly filtered audio and the token timestamps are related to the filtered audio, not the original audio. For the segment timestamps we map/align them with original timestamps but this is not currenly done for the token timestamps which is what this commit aims to address. Resolves: ggml-org#3174
12e44a1
to
c5e33f4
Compare
Uh oh!
There was an error while loading. Please reload this page.
This commit aligns the token timestamps with the VAD timestamps when VAD is enabled.
The motivation of this is that currently the token timestamps that are reported in the full json output are the timestamps that whisper sees after the VAD has processed the audio. This means that whisper only sees possibly filtered audio and the token timestamps are related to the filtered audio, not the original audio. For the segment timestamps we map/align them with original timestamps but this is not currenly done for the token timestamps which is what this commit aims to address.
Resolves: #3174
Example of token level timestamps prior to this PR:
And with this PR: