Mixed languages

#2
by Bidutree - opened

Hi. I have sound files with mixed languages; Swedish and English, and when I try to transcribe them your model seems to completely ignore the spoken English. Is there any way to solve this?

Update: The Base model and Small model both ignore sound in English and tries to translate such sound, seemingly without any specific preference for one or the other and when translating give a mix of English and Swedish.

Bidutree changed discussion status to closed
Bidutree changed discussion status to open
National Library of Sweden / KBLab org
edited 7 days ago

We continued training these models on top of models that were already trained on many languages, and for trainslation from X -> English.

Our training data only consisted of Swedish transcriptions and Swedish audio. However, it seems the models we trained have a tendency to attempt to directly translate X -> Swedish when other languages are spoken.

This is not something we have explicitly trained the models to do. And we haven't used a special token to control when the model should translate (original openai/whisper-large uses a translate token). Therefore it will be hard to fix this behavior in our model.

I think the best solution for you would be to figure out a pipeline where you run language detection first on your audio with openai/whisper-large-v3. After you've performed langdetect on audio chunks, you transcribe the Swedish audio chunks with kb-whisper, and the English chunks with a different model (OpenAI's). You won't get perfect results in audio chunks that contain transitions between languages, but it's probably more in line with what you expect.

Unfortunately I can't provide a ready code example for this, but have a look at doing language detection here: https://discuss.huggingface.co/t/language-detection-with-whisper/26003/13

Sign up or log in to comment