WhisperX で音声認識と話者分離をしてみよう

macOS 編

GPU (MPS) の利用

Apple Silicon にも GPU が搭載されています．NVIDIA の CUDA に相当する技術が MPS (Metal Performance Shaders) です．WhisperX は MPS に対応していませんが，OpenAI whisper は MPS に対応してます．OpenAI whisper と MPS の組合せを用いて音声認識を行うことが可能です．しかしながら，Windows 版の NVIDIA 製と異なり，音声認識の処理速度向上率は抑えられてしまいます．また，実行過程で内部モデルが崩壊してエラーとなったり，ハルシネーションが起こったりすることもあります．さらに，音声認識では GPU を利用できますが，話者分離については現時点では GPU に対応しておらず，結局 CPU を利用することになります．このような意味において現時点では音声認識，話者分離の処理に関しては Windows と NVIDIA GPU の組合せが最適と言えそうです．

まず，Python の仮想環境を新たに作成します．

py310whisper_mps

作成した仮想環境を有効にします．

conda activate py310whisper_mps

OpenAI whisper をインストールします．

pip install openai-whisper

インストールができたことを確認しましょう．

(py310whisper_mps) whisperx % pip list | grep whisper ⏎
openai-whisper     20250625
(py310whisper_mps) whisperx % pip list | grep torch ⏎
torch              2.12.0

環境が構築できたら，GPU を利用した演算が可能であるか簡単に確認してみましょう．次のようなコードをテキストエディタで作成し，任意のファイル名で（例えばmps_test.pyで）保存します．


import torch
print(torch.backends.mps.is_available())

上のコードを実行して True が表示されたら成功です．次に進んでください．

(py310whisper_mps) whisperx % python mps_test.py ⏎
True

macOS では --device mps をオプションを利用することで GPU (MPS) を利用した演算ができるようになります．このとき，--fp16 False オプションを付与して，つまり，16ビット半精度浮動小数点数の利用を無効化することで，（直感とは反して）高速化されるようです．また，コマンドは whisperx ではなく whisper であることにも注意してください．

whisper --model turbo --language ja --device mps --fp16 False -o output_turbo_mps voice.m4a

M1 Max (32コアGPU) 搭載の Mac Studio で上のコマンドを実行すると，およそ1分20秒ですべての処理を終えました．しかしながら，認識精度が低く次のような結果となりました．CPU で処理した結果と比較してください．


1
00:00:00,000 --> 00:00:07,040
日銀は今月15日から2日間金融政策決定会合を開きます

2
00:00:07,040 --> 00:00:12,100
会合を前に都内で講演した日銀の上田総裁は

60
00:09:30,880 --> 00:09:32,160
出しを求めています。

--fp16 False オプションを付与せずに半精度浮動小数点数で実行した場合，認識結果に差はありませんでしたが，2分40秒ほどの時間が必要でした．

whisper --model turbo --language ja --device mps -o output_turbo_mps_fp16true voice.m4a

次は large モデルで試してみます．

whisper --model large --language ja --device mps --fp16 False -o output_large_mps voice.m4a

途中まで認識ができていたようですが，10分ほど経過したところでモデルの内部が崩壊してしまったようです．（--fp16 False を付けない場合であってもおよそ5分後に崩壊しました．）

ValueError: Expected parameter logits (Tensor of shape (5, 51866)) of distribution Categorical(logits: torch.Size([5, 51866])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[-95.9503,     -inf,     -inf,  ...,     -inf,     -inf,     -inf],
        [-42.2203,     -inf,     -inf,  ...,     -inf,     -inf,     -inf],
        [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
        [-19.7479,     -inf,     -inf,  ...,     -inf,     -inf,     -inf],
        [-22.9333,     -inf,     -inf,  ...,     -inf,     -inf,     -inf]],
       device='mps:0')
Skipping voice.m4a due to ValueError: Expected parameter logits (Tensor of shape (5, 51866)) of distribution Categorical(logits: torch.Size([5, 51866])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[-95.9503,     -inf,     -inf,  ...,     -inf,     -inf,     -inf],
        [-42.2203,     -inf,     -inf,  ...,     -inf,     -inf,     -inf],
        [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
        [-19.7479,     -inf,     -inf,  ...,     -inf,     -inf,     -inf],
        [-22.9333,     -inf,     -inf,  ...,     -inf,     -inf,     -inf]],
       device='mps:0')

現時点では Windows + NVIDIA の組合せが圧倒的に良さそうです．

目次に戻る

« 戻る次へ »