Voice assistant example - the "command" tool · ggml-org/whisper.cpp · Discussion #190

ggerganov
Nov 23, 2022
Maintainer

There seems to be significant interest for a voice assistant application of Whisper, similar to "Ok, Google", "Hey Siri", "Alexa", etc. The existing stream tool is not very applicable for this use case, because the voice assistant commands are usually short (i.e. play some music, turn on the TV, kill all humans, feed the baby, etc), while stream expects a continuous stream of speech.

Therefore, implement a basic command-line tool called command that does the following:

Upon start, asks the person to say a "key phrase". The phrase should be an average sentence that normally takes 2-3 seconds to pronounce. We want to have enough "training" data of the person's voice
If the transcribed text matches the expected phrase, then we "remember" this audio and use it later. Else, we ask to say it again until we succeed
We start listening continuously for voice activity using my VAD detector that I implemented for talk.wasm - I think it works very well given it's simplicity
When we detect speech, we prepend the recorded key-phrase to the last 2-3 seconds of the live audio and transcribe
The result should be: [key phrase][command], so by knowing the key phrase we can extract only the [command]

This should work in Web and Raspberry Pi and thanks to the VAD, it will be energy efficient.
Should be a good starting example for creating a voice assistant.

Replies: 13 comments 16 replies

ggerganov
Nov 25, 2022
Maintainer Author

This is now fully functional:

command-0.mp4

Code is in examples/command

Web version: examples/command.wasm

0 replies

This comment has been hidden.

StuartIanNaylor
Nov 27, 2022

The command seems to work great but the keyword not so great, but will give it another try.
Which made me wonder is how good is the timestamp from whisper as is it any good as a forced aligner?
I could try and find 'Whisper' keywords and make a 'Hey Whisper' KWS

1 reply

@ggerganov

ggerganov Nov 27, 2022
Maintainer Author

The command seems to work great but the keyword not so great, but will give it another try.

Do you mean that it does not always recognise the "key phrase" at the start, but after it recognises it, then the commands are processed OK?

The "key phrase" can be easily changed in the source code:

https://github.com/ggerganov/whisper.cpp/blob/4698dcdb5238748951a087a5b26309c6b2826cc0/examples/command/command.cpp#L545-L547

The implemented approach does not depend on the timestamps.
It simply matches the transcribed text and picks everything after the "key phrase".

StuartIanNaylor
Nov 27, 2022

No I was wondering if the timestamps where in anyway accuracte and I could use as a forced alighner to extract Keyword or use another?

0 replies

dnhkng
Jan 3, 2023

I like this, but maybe I am missing something. Should the behaviour be to activate and then never turn off, or be more like Alexa/Siri, and have a 'wake word' followed by a command.

i.e. you say: "Hey Whisper what is the time ", and it outputs something like:
[timestamp] ['what is the time']

Then, until you say a command with "Hey Whisper" at the beginning, nothing happens with normal speech.

8 replies

@dnhkng

dnhkng Jan 6, 2023

I'm wondering which approach makes more sense, as using whisper and a 30 second window seems like a sledge hammer for simple wake word detection.
Two options could be;

just use an off-the-shelf wake word system, to start using whisper until VAD detects a pause, or
use streaming and the VAD detection as a extra token to help parse the text. i.e. we use the whisper streaming mode, and push text to a file, but when the VAD detects a pause, we add a token to the text, like a ';'.
It's pretty trivial to then parse the text in python, splitting on the VAD token, and checking if the first words after a split is the wake word/phrase.

@StuartIanNaylor

StuartIanNaylor Jan 7, 2023

Does it not need a param of minimum prob-threshold?

The problem is an off-the-shelf wake word system doesn't really exist there are quite a few but fixed KW whilst its great to be able to choose without commercial limits such as picovoice.
The beamsearch does give quite an element of latency for just a command and before I forget in the commands.txt should there be a paired whitespace sperated minimum probility threshold for each command or just an overall param to the ./command -prob .6

I like how it stems and breaks up the commands with a probility matrix but for simple commands the latency of the beamsearch is a bit much.
I have been thinking of late that you can drop KWS and have remote VAD to a central whisper that uses KW to authenticate a command.
Also generally a command has a predicate and subject and often a level.
I will get back to distributed streaming VAD later but 'Hey Whisper, Turn on the Light' is an example of KW, Predicate, Level, Subject and it fills up that beamsearch so the latency is less of an issue.
The only single word command I can think of is 'Stop' as the subject is whatever it is already doing.

So yeah I like some elements of what has been done but because of the latency I am thinking a hybrid would be great, but the KW is an audio authentication prefix to a commnd and should allows be there to stop 3rd party spoken words such as TV or media causing havoc.

Going back to VAD its such a simple model I get great results with a CRNN KW where the Noise classification is actually really accurate as a inverse VAD as if its not current argmax the input is spoken.
Even a simple CNN could make a great VAD and you can capture spoken word and deliberate overfit to user(s) voice or even a Speaker diarisation model just to kickstart broadcast where the Whisper.ccp VAD will work for end of sentence.
https://github.com/pirxus/personalVAD

Then you end up with a targetted voice broadcast, kw authentication and a probability stemmed command sentence where the latency of the overall sentence is more tolerable.

@dnhkng

dnhkng Jan 7, 2023

OK, this will take me some time, as I'm a Python programmer, and my C++ is very very rusty... this will take a long time...

But here is the plan:
You pass the wake phrase as an argument
./command --wake_phrase 'hey whisper'

Count the number of words in the wake phrase
phrase_length = len(wake_phrase.split())

Use the "process_general_transcription" in "have_prompt" mode to get a block of text, separated by the VAD, something like:
sample = 'hey, whisper, what's the time?'

Extract the first words, with the same length as the wake phrase:
potential_wake_phrase = ' '.join(sample.split()[:phrase_length])

Compute the similarity between two strings using Levenshtein distance:
similarity = levenshtein(wake_phrase, potential_wake_phrase)

If the similarity is good, output the sample, minus the initial wake phrase:

if similarity > threshold:
 print(' '.join(sample.split()[phrase_length:])

I think that would fit my use case.

@dnhkng

dnhkng Jan 7, 2023

EDIT: OK, using ChatGTP makes this much easier that I expected, should be done today!

Screenshot 2023年01月07日 at 13 28 02

@dnhkng

dnhkng Jan 7, 2023

all done, my first C++ code in 6 years, thanks ChatGTP! Pull request made!

RndyP
Jan 24, 2023

My application needs to handle 1 or 2 word commands. In the future, maybe 3. There are 3 formats:

word
word word
word numeric

Presently all is working pretty well. I built an error dictionary for the words, to correct Whisper's errors. Numeric works very well. The issue is Whisper has trouble with certain words, like "fur" and "crib". The error dictionary takes care of most of the errors.

I just inplemented your command example, and I fill allowed_commands with all the words I need for commands. This is working quite well! The issue is that to use the code like your command example, you need to set max_tokens=1. Unfortunately, this breaks my little scheme of setting max_tokens=3 so I can read in a three second chunk of audio that contains the 2 words. The ugly hack I thought of is to run whisper_full twice, the first time with max_tokens=1 to get the solution to the first word, and then run again with max_tokens=3 to get the second word (which can be numeric or a regular word). This is going to double the time and is not a good solution IMHO.

Is there an easy solution to this?

4 replies

@ggerganov

ggerganov Jan 24, 2023
Maintainer Author

I am planning to improve the guided mode of the command example to support 2, 3 and more words.
You would be able to specify a set of allowed words for the first word, another set of words for the second word, etc.
This can be implemented with a single Encoder pass and then using a proper decoding strategy that conditions the token probabilities based on the allowed words.

I plan to demonstrate this with a simple Chess application that allows you to input moves with voice (#428):

"pawn to d4"
"bishop to e5"
"rook takes knight"
etc.

I think this is very similar to your use case, so keep an eye on this development.
No ETA for the moment.

@RndyP

RndyP Jan 24, 2023

Awesome!

@robomotic

robomotic Mar 19, 2025

Hello @ggerganov this is a great starting point and I have reviewed the current code in main branch.
I have some questions:
a) there is the initial sentence for the activation: if one want to use a single word how is the performance usually affected? Imagine playing a videogame which requires some fast actions, I don't really want to say an entire sentence to then issue a command. What if we allow the user to choose a single but low probability word? This can be for example a latin world like "vox"
b) for the multiple keywords can we have a simple BNF notation or similar to allow things like more generic:
move [player|troops] [left|right|north|south]
drop (all) inventory
jump on *
So basically some form of simple regex to allow: set of possible words, optional words, any words.
I think most of the logic is in this function: https://github.com/ggerganov/whisper.cpp/blob/e27fd6f0c0c14d51ff7035499c2c94d91e090f4d/examples/command/command.cpp#L254 right?

@ggerganov

ggerganov Mar 19, 2025
Maintainer Author

BNF grammar is already supported. See the grammar-based examples and the wchess example.

RndyP
Jan 26, 2023

I went ahead and implemented the "two pass" method I alluded to above, except I switched the order. First I call whisper_full with max_tokens=3. Then, if word 1 is not found after running through the error dictionary, I call whisper_full again with max_tokens=1, and use the command mode code to find a match. There is a problem though. Using command mode will always return a result. I need to look at the probability value to select if the result is accepted, otherwise I get "false positives" (incorrect, but accepted commands) which I really do not want. Just using whisper in non-command mode very rarely will decide on a wrong command, it will just fail, which is preferable to the wrong command. I am now struggling with a probability threshold. I find if it's set too high, the command pass will miss many corrections. If it's set too low, I get false positives.

0 replies

ibrahelsheikh
Feb 3, 2023

hi can I get output in a ON file

0 replies

apotap68
Oct 2, 2023

I have succeded to build main.exe for Windows with VS2022.
But could anybody help me, how can I build command.exe?

2 replies

@StuartIanNaylor

StuartIanNaylor Oct 2, 2023

https://github.com/ggerganov/whisper.cpp/tree/master/examples/command

@apotap68

apotap68 Oct 3, 2023

Thank you for your assistance, but there is nothing there about building for Windows.

Madd0g
Feb 18, 2024

hey I tried command mode and it's pretty great, take a lot less resources on my old intel mac than regular stream.

I have a bunch of questions and suggestions at the same time

Suggestions:

complex commands - commands that MUST start with a certain word but the rest can be any text
parameterized commands - commands that must take an argument with a certain pattern
vocabulary - for specific domains or for command accuracy - I want to add new words that will be "prioritized" during inference

Questions:

Can I make it so it allows more pauses in speech? Like X seconds of quiet to submit? Also I see that whisper "understands" unfinished sentences by putting ... or -- at the end, can we leverage that to extend the quiet pause time? I think it would result in great experience
What is the actual difference between command mode and waiting for command mode? I'm currently just parsing out what's after "Heard" and not even bothering with a wake word, but I'm not even sure what the difference is in regards to capturing the spoken text (if there's no commands.txt file)

Thanks!

p.s. if anyone got cool examples of how they're using this feature, please share!

0 replies

Trambled
Dec 31, 2024

how would i build on windows?

1 reply

@DavidRobb

DavidRobb Dec 31, 2024

Have you already successfully built the basic whisper.cpp package?

It worked for me using CMAKE and the MSVC build using the community version of the Microsoft compiler and tools. You will probably need CMAKE installed and on the path.

Follow the quickstart instructions: https://github.com/ggerganov/whisper.cpp/tree/master

There's a download-ggml-model.cmd for downloading the model instead of the sh command.

If this works you can then try and build the command tools. These require the SDL2 library. I use vcpckg under Windows to make this and many other ported libraries available to MSVC. (install vcpckg, then use this to install SDL2)

For some unknown reason, the CMake Find_package(SDL2) did not work for me. I had to manually tell cmake where SDL2 is installed with the command

setx SDL2_DIR "C:\prog\vcpkg\installed\x64-windows\share\sdl2\"

Change the command path to suit your installation.

then

cmake -B build -DWHISPER_SDL2=ON
cmake --build build --config Release

worked fine for me (Quite a few compiler warnings though). Binaries then available at ./build/bin/Release/

Voice assistant example - the "command" tool #190

Uh oh!

Uh oh!

ggerganov Nov 23, 2022 Maintainer

Replies: 13 comments · 16 replies

Uh oh!

Uh oh!

ggerganov Nov 25, 2022 Maintainer Author

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

Uh oh!

Uh oh!

ggerganov Nov 27, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Jan 24, 2023 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

ggerganov Mar 19, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov
Nov 23, 2022
Maintainer

Replies: 13 comments 16 replies

ggerganov
Nov 25, 2022
Maintainer Author

ggerganov Nov 27, 2022
Maintainer Author

ggerganov Jan 24, 2023
Maintainer Author

ggerganov Mar 19, 2025
Maintainer Author