whisper-cli : align token timestamps with VAD ts #3218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

danbev wants to merge 1 commit into ggml-org:master

from danbev:vad-token-timestamp-alignment

Open

whisper-cli : align token timestamps with VAD ts #3218

danbev wants to merge 1 commit into ggml-org:master from danbev:vad-token-timestamp-alignment

Conversation

danbev

Copy link

Member

@danbev danbev commented Jun 2, 2025 •

edited

Loading

This commit aligns the token timestamps with the VAD timestamps when VAD is enabled.

The motivation of this is that currently the token timestamps that are reported in the full json output are the timestamps that whisper sees after the VAD has processed the audio. This means that whisper only sees possibly filtered audio and the token timestamps are related to the filtered audio, not the original audio. For the segment timestamps we map/align them with original timestamps but this is not currenly done for the token timestamps which is what this commit aims to address.

Resolves: #3174

Example of token level timestamps prior to this PR:

$ ./build/bin/whisper-cli -m models/ggml-medium.en.bin -f samples/gb1.ogg --vad -vm models/for-tests-silero-v5.1.2-ggml.bin -ojf -of gb1
...
[00:00:00.990 --> 00:00:07.800] My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:07.800 --> 00:00:15.860] At 9 o'clock this morning, Mission Control in Houston lost contact with our space shuttle
...
 "transcription": [ 
 { 
 "timestamps": { 
 "from": "00:00:00,990", 
 "to": "00:00:07,800" 
 }, 
 "offsets": { 
 "from": 990, 
 "to": 7800 
 }, 
 "text": " My fellow Americans, this day has brought terrible news and great sadness to our country.",
 "tokens": [ 
 { 
 "text": "[_BEG_]", 
 "timestamps": { 
 "from": "00:00:00,000", 
 "to": "00:00:00,000" 
 }, 
 "offsets": { 
 "from": 0, 
 "to": 0 
 }, 
 "id": 50363, 
 "p": 0.994401, 
 "t_dtw": -1 
 }, 
 { 
 "text": " My", 
 "timestamps": { 
 "from": "00:00:00,020", 
 "to": "00:00:00,100" 
 }, 
 "offsets": { 
 "from": 20, 
 "to": 100 
 }, 
 "id": 2011, 
 "p": 0.883255, 
 "t_dtw": -1 
 }, 
 { 
 "text": " fellow", 
 "timestamps": { 
 "from": "00:00:00,170", 
 "to": "00:00:00,610" 
 }, 
 "offsets": { 
 "from": 170, 
 "to": 610 
 }, 
 "id": 5891, 
 "p": 0.989602, 
 "t_dtw": -1 
 }, 
 ....

And with this PR:

[00:00:00.990 --> 00:00:07.800] My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:07.800 --> 00:00:15.860] At 9 o'clock this morning, Mission Control in Houston lost contact with our space shuttle
[00:00:15.860 --> 00:00:18.510] Columbia.
...
 "transcription": [ 
 { 
 "timestamps": { 
 "from": "00:00:00,990", 
 "to": "00:00:07,800" 
 }, 
 "offsets": { 
 "from": 990, 
 "to": 7800 
 }, 
 "text": " My fellow Americans, this day has brought terrible news and great sadness to our country.",
 "tokens": [ 
 { 
 "text": "[_BEG_]", 
 "timestamps": { 
 "from": "00:00:00,990", 
 "to": "00:00:00,990" 
 }, 
 "offsets": { 
 "from": 990, 
 "to": 990 
 }, 
 "id": 50363, 
 "p": 0.994401, 
 "t_dtw": -1 
 }, 
 { 
 "text": " My", 
 "timestamps": { 
 "from": "00:00:01,000", 
 "to": "00:00:01,080" 
 }, 
 "offsets": { 
 "from": 1000, 
 "to": 1080 
 }, 
 "id": 2011, 
 "p": 0.883255, 
 "t_dtw": -1 
 }, 
 { 
 "text": " fellow", 
 "timestamps": { 
 "from": "00:00:01,140", 
 "to": "00:00:01,540" 
 }, 
 "offsets": { 
 "from": 1140, 
 "to": 1540 
 }, 
 "id": 5891, 
 "p": 0.989602, 
 "t_dtw": -1 
 },

@danbev danbev force-pushed the vad-token-timestamp-alignment branch from b23c671 to 75db936 Compare

June 2, 2025 14:47

@danbev danbev marked this pull request as ready for review

June 3, 2025 04:28

@danbev danbev marked this pull request as draft

June 16, 2025 05:29

@chriswang-

Copy link

chriswang- commented Jun 16, 2025

Has this issue been resolved? It seems it hasn't been merged into the main branch, or has it already been fixed in the branch (vad-token-timestamp-alignment) that I can use it ?

@danbev

Copy link

Member Author

danbev commented Jun 16, 2025

Has this issue been resolved?

No, it has not been resolved yet. I changed it to a draft (which might have sent a notification) as I noticed the token level timestamps are still not correct and I need to revisit this.

@danbev danbev force-pushed the vad-token-timestamp-alignment branch from 75db936 to 12e44a1 Compare

June 16, 2025 11:42

@danbev danbev marked this pull request as ready for review

June 16, 2025 11:42

@danbev

Copy link

Member Author

danbev commented Jun 16, 2025

@chriswang- It would be great if you could try this out with the audio sample in your original issue report.

@chriswang-

Copy link

chriswang- commented Jun 16, 2025

@danbev Sorry The issue is not commited by me, But I can try to verify it .

@danbev

Copy link

Member Author

danbev commented Jun 16, 2025

@chriswang- Ah my bad, I should have checked to be sure and not just assumed.

@chriswang-

Copy link

chriswang- commented Jun 16, 2025

subtitle-master-with-vad.json
subtitle-master-without-vad.json
subtitle-PR.json

@danbev
I have uploaded three files: one with the Master branch result that includes the VAD feature, one without the VAD feature, and the third using your PR for transcription. After a brief comparison, it seems the issue has been resolved. However, I really forgot all the testing context and related information from when I first discovered the bug. I only noticed the bug and confirmed that the same issue exists on GitHub.

@danbev


 whisper-cli : align token timestamps with VAD ts

c5e33f4

This commit aligns the token timestamps with the VAD timestamps when VAD
is enabled.
The motivation of this is that currently the token timestamps that are
reported in the full json output are the timestamps that whisper sees
after the VAD has processed the audio. This means that whisper only sees
possibly filtered audio and the token timestamps are related to the
filtered audio, not the original audio. For the segment timestamps we
map/align them with original timestamps but this is not currenly done
for the token timestamps which is what this commit aims to address.
Resolves: ggml-org#3174

@danbev danbev force-pushed the vad-token-timestamp-alignment branch from 12e44a1 to c5e33f4 Compare

June 24, 2025 11:17

accessiblepixel added a commit to accessiblepixel/whisper.cpp that referenced this pull request

Jul 5, 2025

@accessiblepixel


 Add vad corrections as per ggml-org#3218 to my own branch

50739c6

Labels

None yet

2 participants

@danbev @chriswang-

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

whisper-cli : align token timestamps with VAD ts #3218

Are you sure you want to change the base?

whisper-cli : align token timestamps with VAD ts #3218

Uh oh!

Conversation

@danbev danbev commented Jun 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

Uh oh!

whisper-cli : align token timestamps with VAD ts #3218

Are you sure you want to change the base?

whisper-cli : align token timestamps with VAD ts #3218

Uh oh!

Conversation

@danbev danbev commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

Uh oh!

@danbev danbev commented Jun 2, 2025 •

edited

Loading