Demo code not functional

by urroxyz - opened 29 days ago

29 days ago

I am having a serious issue with the model where inference hangs indefinitely on GPU after the files are fetched. The GPU spikes to full usage and stays there forever with no output, no error, and no progress. This happens even on 5-second audio clips, so it is not related to audio length. I have tried everything I can think of and need help.

Environment:

Windows 11
Python 3.10
ESPnet latest (pip install espnet -U)
PyTorch 2.4+ with CUDA 12.1 (torch.cuda.is_available() = True)
RTX 4070 GPU

Basic failing code:

from espnet2.bin.s2t_inference import Speech2Text
import soundfile
import librosa

model = Speech2Text.from_pretrained("espnet/owls_025B_180K", device="cuda")
speech, rate = soundfile.read("ado.mp3")
speech = librosa.resample(speech, orig_sr=rate, target_sr=16000)
text, *_ = model(speech)[0]
print(text)

Output before hang:

Failed to import Flash Attention, using ESPnet default: No module named 'flash_attn'
Fetching 8 files: 100% 8/8 [00:00<00:00, XXX.XXit/s]
FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
Then hangs forever. nvidia-smi shows high VRAM and GPU usage but no crash.

What I have tried:

Audio length and chunking
Tested 4-minute, 30-second, 10-second, and 5-second clips.
Manually chunked into 30s segments with padding.
Same result on every chunk.
CPU inference
device="cpu" also does not work.
Other models
Tried owls_05B_180K, owls_1B_180K, etc.

Thanks

wanchichen

ESPnet org 23 days ago

Hi! Thanks for reporting this. What happens if you try this model? espnet/owsm_v3.1_ebf_base. They share mostly the same underlying code but the checkpoints are saved differently.

If you encounter the same issue, it is likely some kind of PyTorch or espnet installation issue.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment