Demo code not functional
I am having a serious issue with the model where inference hangs indefinitely on GPU after the files are fetched. The GPU spikes to full usage and stays there forever with no output, no error, and no progress. This happens even on 5-second audio clips, so it is not related to audio length. I have tried everything I can think of and need help.
Environment:
- Windows 11
- Python 3.10
- ESPnet latest (
pip install espnet -U) - PyTorch 2.4+ with CUDA 12.1 (
torch.cuda.is_available() = True) - RTX 4070 GPU
Basic failing code:
from espnet2.bin.s2t_inference import Speech2Text
import soundfile
import librosa
model = Speech2Text.from_pretrained("espnet/owls_025B_180K", device="cuda")
speech, rate = soundfile.read("ado.mp3")
speech = librosa.resample(speech, orig_sr=rate, target_sr=16000)
text, *_ = model(speech)[0]
print(text)
Output before hang:
Failed to import Flash Attention, using ESPnet default: No module named 'flash_attn'
Fetching 8 files: 100% 8/8 [00:00<00:00, XXX.XXit/s]
FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
Then hangs forever. nvidia-smi shows high VRAM and GPU usage but no crash.
What I have tried:
Audio length and chunking
Tested 4-minute, 30-second, 10-second, and 5-second clips.
Manually chunked into 30s segments with padding.
Same result on every chunk.CPU inference
device="cpu"also does not work.Other models
Tried owls_05B_180K, owls_1B_180K, etc.
Thanks
Hi! Thanks for reporting this. What happens if you try this model? espnet/owsm_v3.1_ebf_base. They share mostly the same underlying code but the checkpoints are saved differently.
If you encounter the same issue, it is likely some kind of PyTorch or espnet installation issue.