Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Voice assistant example - the "command" tool #190

ggerganov started this conversation in Show and tell
Discussion options

There seems to be significant interest for a voice assistant application of Whisper, similar to "Ok, Google", "Hey Siri", "Alexa", etc. The existing stream tool is not very applicable for this use case, because the voice assistant commands are usually short (i.e. play some music, turn on the TV, kill all humans, feed the baby, etc), while stream expects a continuous stream of speech.

Therefore, implement a basic command-line tool called command that does the following:

  • Upon start, asks the person to say a "key phrase". The phrase should be an average sentence that normally takes 2-3 seconds to pronounce. We want to have enough "training" data of the person's voice
  • If the transcribed text matches the expected phrase, then we "remember" this audio and use it later. Else, we ask to say it again until we succeed
  • We start listening continuously for voice activity using my VAD detector that I implemented for talk.wasm - I think it works very well given it's simplicity
  • When we detect speech, we prepend the recorded key-phrase to the last 2-3 seconds of the live audio and transcribe
  • The result should be: [key phrase][command], so by knowing the key phrase we can extract only the [command]

This should work in Web and Raspberry Pi and thanks to the VAD, it will be energy efficient.
Should be a good starting example for creating a voice assistant.

You must be logged in to vote

Replies: 13 comments 16 replies

Comment options

ggerganov
Nov 25, 2022
Maintainer Author

This is now fully functional:

command-0.mp4

Code is in examples/command

Web version: examples/command.wasm

You must be logged in to vote
0 replies

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

Comment options

The command seems to work great but the keyword not so great, but will give it another try.
Which made me wonder is how good is the timestamp from whisper as is it any good as a forced aligner?
I could try and find 'Whisper' keywords and make a 'Hey Whisper' KWS

You must be logged in to vote
1 reply
Comment options

ggerganov Nov 27, 2022
Maintainer Author

The command seems to work great but the keyword not so great, but will give it another try.

Do you mean that it does not always recognise the "key phrase" at the start, but after it recognises it, then the commands are processed OK?

The "key phrase" can be easily changed in the source code:

https://github.com/ggerganov/whisper.cpp/blob/4698dcdb5238748951a087a5b26309c6b2826cc0/examples/command/command.cpp#L545-L547

The implemented approach does not depend on the timestamps.
It simply matches the transcribed text and picks everything after the "key phrase".

Comment options

No I was wondering if the timestamps where in anyway accuracte and I could use as a forced alighner to extract Keyword or use another?

You must be logged in to vote
0 replies
Comment options

I like this, but maybe I am missing something. Should the behaviour be to activate and then never turn off, or be more like Alexa/Siri, and have a 'wake word' followed by a command.

i.e. you say: "Hey Whisper what is the time ", and it outputs something like:
[timestamp] ['what is the time']

Then, until you say a command with "Hey Whisper" at the beginning, nothing happens with normal speech.

You must be logged in to vote
8 replies
Comment options

I'm wondering which approach makes more sense, as using whisper and a 30 second window seems like a sledge hammer for simple wake word detection.
Two options could be;

  1. just use an off-the-shelf wake word system, to start using whisper until VAD detects a pause, or
  2. use streaming and the VAD detection as a extra token to help parse the text. i.e. we use the whisper streaming mode, and push text to a file, but when the VAD detects a pause, we add a token to the text, like a ';'.
    It's pretty trivial to then parse the text in python, splitting on the VAD token, and checking if the first words after a split is the wake word/phrase.
Comment options

Does it not need a param of minimum prob-threshold?

The problem is an off-the-shelf wake word system doesn't really exist there are quite a few but fixed KW whilst its great to be able to choose without commercial limits such as picovoice.
The beamsearch does give quite an element of latency for just a command and before I forget in the commands.txt should there be a paired whitespace sperated minimum probility threshold for each command or just an overall param to the ./command -prob .6

I like how it stems and breaks up the commands with a probility matrix but for simple commands the latency of the beamsearch is a bit much.
I have been thinking of late that you can drop KWS and have remote VAD to a central whisper that uses KW to authenticate a command.
Also generally a command has a predicate and subject and often a level.
I will get back to distributed streaming VAD later but 'Hey Whisper, Turn on the Light' is an example of KW, Predicate, Level, Subject and it fills up that beamsearch so the latency is less of an issue.
The only single word command I can think of is 'Stop' as the subject is whatever it is already doing.

So yeah I like some elements of what has been done but because of the latency I am thinking a hybrid would be great, but the KW is an audio authentication prefix to a commnd and should allows be there to stop 3rd party spoken words such as TV or media causing havoc.

Going back to VAD its such a simple model I get great results with a CRNN KW where the Noise classification is actually really accurate as a inverse VAD as if its not current argmax the input is spoken.
Even a simple CNN could make a great VAD and you can capture spoken word and deliberate overfit to user(s) voice or even a Speaker diarisation model just to kickstart broadcast where the Whisper.ccp VAD will work for end of sentence.
https://github.com/pirxus/personalVAD

Then you end up with a targetted voice broadcast, kw authentication and a probability stemmed command sentence where the latency of the overall sentence is more tolerable.

Comment options

OK, this will take me some time, as I'm a Python programmer, and my C++ is very very rusty... this will take a long time...

But here is the plan:
You pass the wake phrase as an argument
./command --wake_phrase 'hey whisper'

Count the number of words in the wake phrase
phrase_length = len(wake_phrase.split())

Use the "process_general_transcription" in "have_prompt" mode to get a block of text, separated by the VAD, something like:
sample = 'hey, whisper, what's the time?'

Extract the first words, with the same length as the wake phrase:
potential_wake_phrase = ' '.join(sample.split()[:phrase_length])

Compute the similarity between two strings using Levenshtein distance:
similarity = levenshtein(wake_phrase, potential_wake_phrase)

If the similarity is good, output the sample, minus the initial wake phrase:

if similarity > threshold:
 print(' '.join(sample.split()[phrase_length:])

I think that would fit my use case.

Comment options

EDIT: OK, using ChatGTP makes this much easier that I expected, should be done today!

Screenshot 2023年01月07日 at 13 28 02

Comment options

all done, my first C++ code in 6 years, thanks ChatGTP! Pull request made!

Comment options

My application needs to handle 1 or 2 word commands. In the future, maybe 3. There are 3 formats:

  1. word
  2. word word
  3. word numeric

Presently all is working pretty well. I built an error dictionary for the words, to correct Whisper's errors. Numeric works very well. The issue is Whisper has trouble with certain words, like "fur" and "crib". The error dictionary takes care of most of the errors.

I just inplemented your command example, and I fill allowed_commands with all the words I need for commands. This is working quite well! The issue is that to use the code like your command example, you need to set max_tokens=1. Unfortunately, this breaks my little scheme of setting max_tokens=3 so I can read in a three second chunk of audio that contains the 2 words. The ugly hack I thought of is to run whisper_full twice, the first time with max_tokens=1 to get the solution to the first word, and then run again with max_tokens=3 to get the second word (which can be numeric or a regular word). This is going to double the time and is not a good solution IMHO.

Is there an easy solution to this?

You must be logged in to vote
4 replies
Comment options

ggerganov Jan 24, 2023
Maintainer Author

I am planning to improve the guided mode of the command example to support 2, 3 and more words.
You would be able to specify a set of allowed words for the first word, another set of words for the second word, etc.
This can be implemented with a single Encoder pass and then using a proper decoding strategy that conditions the token probabilities based on the allowed words.

I plan to demonstrate this with a simple Chess application that allows you to input moves with voice (#428):

"pawn to d4"
"bishop to e5"
"rook takes knight"
etc.

I think this is very similar to your use case, so keep an eye on this development.
No ETA for the moment.

Comment options

Awesome!

Comment options

Hello @ggerganov this is a great starting point and I have reviewed the current code in main branch.
I have some questions:
a) there is the initial sentence for the activation: if one want to use a single word how is the performance usually affected? Imagine playing a videogame which requires some fast actions, I don't really want to say an entire sentence to then issue a command. What if we allow the user to choose a single but low probability word? This can be for example a latin world like "vox"
b) for the multiple keywords can we have a simple BNF notation or similar to allow things like more generic:
move [player|troops] [left|right|north|south]
drop (all) inventory
jump on *
So basically some form of simple regex to allow: set of possible words, optional words, any words.
I think most of the logic is in this function: https://github.com/ggerganov/whisper.cpp/blob/e27fd6f0c0c14d51ff7035499c2c94d91e090f4d/examples/command/command.cpp#L254 right?

Comment options

ggerganov Mar 19, 2025
Maintainer Author

BNF grammar is already supported. See the grammar-based examples and the wchess example.

Comment options

I went ahead and implemented the "two pass" method I alluded to above, except I switched the order. First I call whisper_full with max_tokens=3. Then, if word 1 is not found after running through the error dictionary, I call whisper_full again with max_tokens=1, and use the command mode code to find a match. There is a problem though. Using command mode will always return a result. I need to look at the probability value to select if the result is accepted, otherwise I get "false positives" (incorrect, but accepted commands) which I really do not want. Just using whisper in non-command mode very rarely will decide on a wrong command, it will just fail, which is preferable to the wrong command. I am now struggling with a probability threshold. I find if it's set too high, the command pass will miss many corrections. If it's set too low, I get false positives.

You must be logged in to vote
0 replies
Comment options

hi can I get output in a ON file

You must be logged in to vote
0 replies
Comment options

I have succeded to build main.exe for Windows with VS2022.
But could anybody help me, how can I build command.exe?

You must be logged in to vote
2 replies
Comment options

Comment options

Thank you for your assistance, but there is nothing there about building for Windows.

Comment options

hey I tried command mode and it's pretty great, take a lot less resources on my old intel mac than regular stream.

I have a bunch of questions and suggestions at the same time

Suggestions:

  1. complex commands - commands that MUST start with a certain word but the rest can be any text
  2. parameterized commands - commands that must take an argument with a certain pattern
  3. vocabulary - for specific domains or for command accuracy - I want to add new words that will be "prioritized" during inference

Questions:

  1. Can I make it so it allows more pauses in speech? Like X seconds of quiet to submit? Also I see that whisper "understands" unfinished sentences by putting ... or -- at the end, can we leverage that to extend the quiet pause time? I think it would result in great experience
  2. What is the actual difference between command mode and waiting for command mode? I'm currently just parsing out what's after "Heard" and not even bothering with a wake word, but I'm not even sure what the difference is in regards to capturing the spoken text (if there's no commands.txt file)

Thanks!

p.s. if anyone got cool examples of how they're using this feature, please share!

You must be logged in to vote
0 replies
Comment options

how would i build on windows?

You must be logged in to vote
1 reply
Comment options

Have you already successfully built the basic whisper.cpp package?

It worked for me using CMAKE and the MSVC build using the community version of the Microsoft compiler and tools. You will probably need CMAKE installed and on the path.

Follow the quickstart instructions: https://github.com/ggerganov/whisper.cpp/tree/master

There's a download-ggml-model.cmd for downloading the model instead of the sh command.

If this works you can then try and build the command tools. These require the SDL2 library. I use vcpckg under Windows to make this and many other ported libraries available to MSVC. (install vcpckg, then use this to install SDL2)

For some unknown reason, the CMake Find_package(SDL2) did not work for me. I had to manually tell cmake where SDL2 is installed with the command

setx SDL2_DIR "C:\prog\vcpkg\installed\x64-windows\share\sdl2\"

Change the command path to suit your installation.

then

cmake -B build -DWHISPER_SDL2=ON
cmake --build build --config Release

worked fine for me (Quite a few compiler warnings though). Binaries then available at ./build/bin/Release/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ideas Interesting ideas for experimentation
Converted from issue

This discussion was converted from issue #171 on November 27, 2022 09:36.

AltStyle によって変換されたページ (->オリジナル) /