-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Non-English translations #526
-
Thanks @ggerganov for a phenomenal piece of s/w! OpenAI has def created fantastic models (more below). But my inference tests using the OpenAI Whisper impl on my puny MacBook Air and on beefy AWS servers were phenomenally bad :( 11 mins for a 30 sec audio clip e.g
Compare OpenAI Whisper w/ Whisper.cpp:
-Same puny MacBook Air
-Tiny model (77 Mb on disk)
- "-l 'auto'" coz my tests mostly involve non-english dialogs
- ~3-4 secs total for ~60 seconds audio
- Error rates < 10% when dialogs are in one lang mostly
-
- Mixed dialogs (say english+Hindi) not handled well but not imp to me
But...have a Q re outputs with non-english audio. Why is it transliterated in english? e.g ""box office pe asa tandav kiya hai fund ne" is whisper.cpp's output. But its transliteration of "बॉक्स ऑफिस पर ऐसा तांडव किया है फंड ने"
Does this mean - OpenAI's models were trained on transliterated dialogs? One can use other APIs to change these back to Hindi e.g But the concern is - theres always some lossy conversions that happen. So, would be best to go from speech -> Hindi directly. Or am I missing something here?
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 2 comments 7 replies
-
Did you add --language option ?
Beta Was this translation helpful? Give feedback.
All reactions
-
Ah, I now read your question more carefully and understand the problem that you have.
I think it is possible that the model simply decides to use english transliteration instead of Hindi.
You should be able to fix this by passing a sample Hindi prompt to help the model "understand" what you want.
For example, try running this:
./main -m ./models/tiny.bin -l auto -p "यह हिंदी में कुछ यादृच्छिक पाठ है"
Beta Was this translation helpful? Give feedback.
All reactions
-
Thats a good tip, thanks! I should experiment w/ prompts.
I also got the ggml-medium.bin & I can see "./main -m ./models/ggml-medium.bin -l auto" detecting languages much better. And also producing transcripts in the desired language now.
It has expected side-effects however - larger models consuming more mem+time. So, definitely needs beefier Macs.
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm also trying to use translate from english to Hebrew but it always transcribe in English.
Beta Was this translation helpful? Give feedback.
All reactions
-
@ggerganov i'm trying to do what the OP was accidentally getting.. i.e. i'm trying to generate the transliterated hindi text. i tried --prompt "box office pe asa tandav kiya hai fund ne"
to tell the model what i want but not able to get it to work. my command looks like this: ./main -m models/ggml-large-v3.bin -f filename.wav --prompt "box office pe asa tandav kiya hai fund ne"
also tried with -l auto
and -l hi
but still not able to get it to transliterate. i either get the devnagiri script or the translated english text.
any thoughts? thanks!
Beta Was this translation helpful? Give feedback.
All reactions
-
You can can translate ONLY INTO English
Beta Was this translation helpful? Give feedback.
All reactions
-
I believe Hindi and maybe Tamil were the only languages I've worked with with whisper.cpp where I had to specify something other than UTF-8
echo "Text encoding 99% of time utf-8 but for Hindi"
echo "Had to use latin-1 for whatever reason"
echo "chardetect even detected it as utf-8"
echo "https://docs.python.org/3/library/codecs.html#standard-encodings"
Beta Was this translation helpful? Give feedback.