Non-English translations · ggml-org/whisper.cpp · Discussion #526

sanjaymk908
Feb 22, 2023

Thanks @ggerganov for a phenomenal piece of s/w! OpenAI has def created fantastic models (more below). But my inference tests using the OpenAI Whisper impl on my puny MacBook Air and on beefy AWS servers were phenomenally bad :( 11 mins for a 30 sec audio clip e.g

Compare OpenAI Whisper w/ Whisper.cpp:
-Same puny MacBook Air
-Tiny model (77 Mb on disk)

"-l 'auto'" coz my tests mostly involve non-english dialogs
~3-4 secs total for ~60 seconds audio
Error rates < 10% when dialogs are in one lang mostly
- Mixed dialogs (say english+Hindi) not handled well but not imp to me

But...have a Q re outputs with non-english audio. Why is it transliterated in english? e.g ""box office pe asa tandav kiya hai fund ne" is whisper.cpp's output. But its transliteration of "बॉक्स ऑफिस पर ऐसा तांडव किया है फंड ने"

Does this mean - OpenAI's models were trained on transliterated dialogs? One can use other APIs to change these back to Hindi e.g But the concern is - theres always some lossy conversions that happen. So, would be best to go from speech -> Hindi directly. Or am I missing something here?

Replies: 2 comments 7 replies

tinoue
Feb 27, 2023

Did you add --language option ?

7 replies

@ggerganov

ggerganov Feb 27, 2023
Maintainer

Ah, I now read your question more carefully and understand the problem that you have.
I think it is possible that the model simply decides to use english transliteration instead of Hindi.

You should be able to fix this by passing a sample Hindi prompt to help the model "understand" what you want.
For example, try running this:

./main -m ./models/tiny.bin -l auto -p "यह हिंदी में कुछ यादृच्छिक पाठ है"

@sanjaymk908

sanjaymk908 Feb 28, 2023
Author

Thats a good tip, thanks! I should experiment w/ prompts.
I also got the ggml-medium.bin & I can see "./main -m ./models/ggml-medium.bin -l auto" detecting languages much better. And also producing transcripts in the desired language now.
It has expected side-effects however - larger models consuming more mem+time. So, definitely needs beefier Macs.

@thewh1teagle

thewh1teagle May 22, 2024

I'm also trying to use translate from english to Hebrew but it always transcribe in English.

@baseliners

baseliners Oct 7, 2024

@ggerganov i'm trying to do what the OP was accidentally getting.. i.e. i'm trying to generate the transliterated hindi text. i tried --prompt "box office pe asa tandav kiya hai fund ne" to tell the model what i want but not able to get it to work. my command looks like this: ./main -m models/ggml-large-v3.bin -f filename.wav --prompt "box office pe asa tandav kiya hai fund ne"

also tried with -l auto and -l hi but still not able to get it to transliterate. i either get the devnagiri script or the translated english text.

any thoughts? thanks!

@thewh1teagle

thewh1teagle Oct 7, 2024

You can can translate ONLY INTO English

mrfragger
Oct 13, 2024

I believe Hindi and maybe Tamil were the only languages I've worked with with whisper.cpp where I had to specify something other than UTF-8

echo "Text encoding 99% of time utf-8 but for Hindi"
echo "Had to use latin-1 for whatever reason"
echo "chardetect even detected it as utf-8"
echo "https://docs.python.org/3/library/codecs.html#standard-encodings"

0 replies

Non-English translations #526

Uh oh!

sanjaymk908 Feb 22, 2023

Replies: 2 comments · 7 replies

Uh oh!

tinoue Feb 27, 2023

Uh oh!

ggerganov Feb 27, 2023 Maintainer

Uh oh!

sanjaymk908 Feb 28, 2023 Author

Uh oh!

Uh oh!

thewh1teagle May 22, 2024

Uh oh!

baseliners Oct 7, 2024

Uh oh!

Uh oh!

thewh1teagle Oct 7, 2024

Uh oh!

mrfragger Oct 13, 2024

sanjaymk908
Feb 22, 2023

Replies: 2 comments 7 replies

tinoue
Feb 27, 2023

ggerganov Feb 27, 2023
Maintainer

sanjaymk908 Feb 28, 2023
Author

mrfragger
Oct 13, 2024