Demo code not functional

#2
by urroxyz - opened

I am having a serious issue with the model where inference hangs indefinitely on GPU after the files are fetched. The GPU spikes to full usage and stays there forever with no output, no error, and no progress. This happens even on 5-second audio clips, so it is not related to audio length. I have tried everything I can think of and need help.

Environment:

  • Windows 11
  • Python 3.10
  • ESPnet latest (pip install espnet -U)
  • PyTorch 2.4+ with CUDA 12.1 (torch.cuda.is_available() = True)
  • RTX 4070 GPU

Basic failing code:

from espnet2.bin.s2t_inference import Speech2Text
import soundfile
import librosa

model = Speech2Text.from_pretrained("espnet/owls_025B_180K", device="cuda")
speech, rate = soundfile.read("ado.mp3")
speech = librosa.resample(speech, orig_sr=rate, target_sr=16000)
text, *_ = model(speech)[0]
print(text)

Output before hang:

Failed to import Flash Attention, using ESPnet default: No module named 'flash_attn'
Fetching 8 files: 100% 8/8 [00:00<00:00, XXX.XXit/s]
FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
Then hangs forever. nvidia-smi shows high VRAM and GPU usage but no crash.

What I have tried:

  1. Audio length and chunking
    Tested 4-minute, 30-second, 10-second, and 5-second clips.
    Manually chunked into 30s segments with padding.
    Same result on every chunk.

  2. CPU inference
    device="cpu" also does not work.

  3. Other models
    Tried owls_05B_180K, owls_1B_180K, etc.

Thanks

ESPnet org

Hi! Thanks for reporting this. What happens if you try this model? espnet/owsm_v3.1_ebf_base. They share mostly the same underlying code but the checkpoints are saved differently.

If you encounter the same issue, it is likely some kind of PyTorch or espnet installation issue.

Sign up or log in to comment