WhisperX で音声認識と話者分離をしてみよう

Windows 編

CPU による音声認識・話者分離

話者分離を実行するには hugginface_hub にログインした状態で，whisperx の実行時に --diarize オプションを付与するだけです．まず，large モデルで音声認識した後，話者分離を行います．

(whisperx_cpu) PS C:\Users\...\whisperx> whisperx --model large --language ja voice.m4a -o output_large_diarize --diarize ⏎

次に，turbo モデルで音声認識した後，話者分離を行います．なお，話者分離には large / turbo などモデル指定は影響しないことに注意してください．

(whisperx_cpu) PS C:\Users\...\whisperx> whisperx --model turbo --language ja voice.m4a -o output_turbo_diarize --diarize ⏎

音声認識だけの場合，large モデルの処理時間は8分20秒でしたが，話者分離も実行するとおよそ12分00秒という結果になりました．また，turbo モデルの音声認識処理時間は5分00秒でしたが，話者分離も含めるとおよそ9分00秒という結果になりました．

認識結果を確認します．large モデルの srt 形式だけを示しますが，話者の情報も結果に付与されたことがわかりました．

output_turbo_diarize/voice.srt
1
00:00:02,209 --> 00:00:20,488
[SPEAKER_01]: 日銀は今月15日から2日間、金融政策決定会合を開きます。会合を前に都内で講演した日銀の上田総裁は、仮に中東情勢が不透明な状況が続くとしても、日上に踏み切る可能性があるとの考えを示しました。

2
00:00:22,849 --> 00:00:51,995
[SPEAKER_01]: 先行き経済の下振れリスクに比べて物価の上振れリスクが高まると判断される場合には利上げの是非についてしっかりと議論する必要があると考えています日銀は去年12月の金融政策決定会合で政策金利を0.75%に引き上げていますが上田総裁はこれまでの利上げによっても金融経済活動は抑制されていない

．．．（中略）．．．

20
00:08:21,235 --> 00:08:49,042
[SPEAKER_04]: どうしてこんな病気になったのかと悔しい思いでいっぱいでした一人でも多くの全職患者が救済され安心と希望を持って生きられることを願っています私には兄が3人と弟が1人います今も水俣に住んでいる3男の兄は特措法で救済されましたが他の4人は

．．．（中略）．．．

23
00:09:45,465 --> 00:09:58,878
[SPEAKER_01]: 今日の日経平均株価の終わり値は1667円高い6万8402円で、初めて終わり値ベースで6万8000円台をつけ、市場最高値を更新しました。

ここで，WhisperXのセグメントごとの話者がどのように決定されるかについて概観しておきましょう．上のセグメント1と2について考えます．セグメント1はアナウンサー (SPEAKER_01) の発話です．次のセグメント2については，冒頭から「・・・必要があると考えています」までが日銀植田総裁の発話であり，その後の「日銀は去年・・・」以降は再びアナウンサーの発話です．しかしながら，SRTファイルではセグメント2の全体がアウンサー (SPEAKER_01) の発話として認識されているように見えます．

output_turbo_diarize/voice.srt
1
00:00:02,209 --> 00:00:20,488
[SPEAKER_01]: 日銀は今月15日から2日間、金融政策決定会合を開きます。会合を前に都内で講演した日銀の上田総裁は、仮に中東情勢が不透明な状況が続くとしても、日上に踏み切る可能性があるとの考えを示しました。

2
00:00:22,849 --> 00:00:51,995
[SPEAKER_01]: 先行き経済の下振れリスクに比べて物価の上振れリスクが高まると判断される場合には利上げの是非についてしっかりと議論する必要があると考えています日銀は去年12月の金融政策決定会合で政策金利を0.75%に引き上げていますが上田総裁はこれまでの利上げによっても金融経済活動は抑制されていない

この結果の詳細について知りたいので，JSONファイルについてセグメント2の部分を見てみましょう．短いトークン (word) ごとに話者情報が付与されており，3行目から72行目までが SPEAKER_02 の発話になっています．また，73行目から143行目までが SPEAKER_01 の発話として認識されています．ここで，SPEAKER_02 が発話した時間は 37.272 - 22.849 = 14.423秒，一方で，SPEAKER_01 の発話時間は 51.995 - 37.272 = 14.723秒 です．つまり，SPEAKER_01 の発話が SPEAKER_02 よりも0.3秒だけ長いことから，セグメント2の話者は SPEAKER_01 つまりアナウンサーであると認識され，144行目にその情報が格納されたことになります．

output_turbo_diarize/voice.json
{"start": 22.849, "end": 51.995, "text": "先行き経済の下振れリスクに比べて物価の上振れリスクが高まると判断される場合には利上げの是非についてしっかりと議論する必要があると考えています日銀は去年12月の金融政策決定会合で政策金利を0.75%に引き上げていますが上田総裁はこれまでの利上げによっても金融経済活動は抑制されていない",
"words": [
{"word": "先", "start": 22.849, "end": 23.109, "score": 0.92, "speaker": "SPEAKER_02"\},
{"word": "行", "start": 23.109, "end": 23.269, "score": 0.875, "speaker": "SPEAKER_02"\},
{"word": "き", "start": 23.269, "end": 23.809, "score": 0.963, "speaker": "SPEAKER_02"\},
{"word": "経", "start": 23.809, "end": 23.969, "score": 0.875, "speaker": "SPEAKER_02"\},
{"word": "済", "start": 23.969, "end": 23.989, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "の", "start": 23.989, "end": 24.009, "score": 0.003, "speaker": "SPEAKER_02"\},
{"word": "下", "start": 24.009, "end": 24.029, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "振", "start": 24.029, "end": 24.089, "score": 0.666, "speaker": "SPEAKER_02"\},
{"word": "れ", "start": 24.089, "end": 24.129, "score": 0.495, "speaker": "SPEAKER_02"\},
{"word": "リ", "start": 24.129, "end": 24.149, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "ス", "start": 24.149, "end": 24.169, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "ク", "start": 24.169, "end": 24.189, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "に", "start": 24.189, "end": 24.209, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "比", "start": 24.209, "end": 24.229, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "べ", "start": 24.229, "end": 24.249, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "て", "start": 24.249, "end": 24.289, "score": 0.499, "speaker": "SPEAKER_02"\},
{"word": "物", "start": 24.289, "end": 24.369, "score": 0.75, "speaker": "SPEAKER_02"\},
{"word": "価", "start": 24.369, "end": 24.509, "score": 0.857, "speaker": "SPEAKER_02"\},
{"word": "の", "start": 24.509, "end": 24.589, "score": 0.749, "speaker": "SPEAKER_02"\},
{"word": "上", "start": 24.589, "end": 24.609, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "振", "start": 24.609, "end": 24.709, "score": 0.798, "speaker": "SPEAKER_02"\},
{"word": "れ", "start": 24.709, "end": 24.729, "score": 0.025, "speaker": "SPEAKER_02"\},
{"word": "リ", "start": 24.729, "end": 24.829, "score": 0.799, "speaker": "SPEAKER_02"\},
{"word": "ス", "start": 24.829, "end": 24.949, "score": 0.833, "speaker": "SPEAKER_02"\},
{"word": "ク", "start": 24.949, "end": 25.229, "score": 0.926, "speaker": "SPEAKER_02"\},
{"word": "が", "start": 25.229, "end": 25.51, "score": 0.928, "speaker": "SPEAKER_02"\},
{"word": "高", "start": 25.51, "end": 26.43, "score": 0.978, "speaker": "SPEAKER_02"\},
{"word": "ま", "start": 26.43, "end": 26.55, "score": 0.833, "speaker": "SPEAKER_02"\},
{"word": "る", "start": 26.55, "end": 26.71, "score": 0.875, "speaker": "SPEAKER_02"\},
{"word": "と", "start": 26.71, "end": 26.87, "score": 0.866, "speaker": "SPEAKER_02"\},
{"word": "判", "start": 26.87, "end": 26.95, "score": 0.749, "speaker": "SPEAKER_02"\},
{"word": "断", "start": 26.95, "end": 27.23, "score": 0.927, "speaker": "SPEAKER_02"\},
{"word": "さ", "start": 27.23, "end": 29.27, "score": 0.989, "speaker": "SPEAKER_02"\},
{"word": "れ", "start": 29.27, "end": 29.35, "score": 0.75, "speaker": "SPEAKER_02"\},
{"word": "る", "start": 29.35, "end": 29.49, "score": 0.857, "speaker": "SPEAKER_02"\},
{"word": "場", "start": 29.49, "end": 29.51, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "合", "start": 29.51, "end": 29.67, "score": 0.874, "speaker": "SPEAKER_02"\},
{"word": "に", "start": 29.67, "end": 29.79, "score": 0.833, "speaker": "SPEAKER_02"\},
{"word": "は", "start": 29.79, "end": 30.551, "score": 0.968, "speaker": "SPEAKER_02"\},
{"word": "利", "start": 30.551, "end": 30.571, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "上", "start": 30.571, "end": 30.771, "score": 0.899, "speaker": "SPEAKER_02"\},
{"word": "げ", "start": 30.771, "end": 30.971, "score": 0.899, "speaker": "SPEAKER_02"\},
{"word": "の", "start": 30.971, "end": 31.651, "score": 0.969, "speaker": "SPEAKER_02"\},
{"word": "是", "start": 31.651, "end": 31.671, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "非", "start": 31.671, "end": 31.811, "score": 0.857, "speaker": "SPEAKER_02"\},
{"word": "に", "start": 31.811, "end": 31.851, "score": 0.498, "speaker": "SPEAKER_02"\},
{"word": "つ", "start": 31.851, "end": 31.911, "score": 0.666, "speaker": "SPEAKER_02"\},
{"word": "い", "start": 31.911, "end": 32.051, "score": 0.857, "speaker": "SPEAKER_02"\},
{"word": "て", "start": 32.051, "end": 32.471, "score": 0.952, "speaker": "SPEAKER_02"\},
{"word": "し", "start": 32.471, "end": 32.491, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "っ", "start": 32.491, "end": 32.631, "score": 0.857, "speaker": "SPEAKER_02"\},
{"word": "か", "start": 32.631, "end": 32.751, "score": 0.831, "speaker": "SPEAKER_02"\},
{"word": "り", "start": 32.751, "end": 32.831, "score": 0.75, "speaker": "SPEAKER_02"\},
{"word": "と", "start": 32.831, "end": 32.851, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "議", "start": 32.851, "end": 32.871, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "論", "start": 32.871, "end": 32.971, "score": 0.8, "speaker": "SPEAKER_02"\},
{"word": "す", "start": 32.971, "end": 32.991, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "る", "start": 32.991, "end": 33.011, "score": 0.001, "speaker": "SPEAKER_02"\},
{"word": "必", "start": 33.011, "end": 33.031, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "要", "start": 33.031, "end": 33.051, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "が", "start": 33.051, "end": 33.071, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "あ", "start": 33.071, "end": 33.091, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "る", "start": 33.091, "end": 33.211, "score": 0.832, "speaker": "SPEAKER_02"\},
{"word": "と", "start": 33.211, "end": 33.271, "score": 0.666, "speaker": "SPEAKER_02"\},
{"word": "考", "start": 33.271, "end": 33.291, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "え", "start": 33.291, "end": 33.351, "score": 0.667, "speaker": "SPEAKER_02"\},
{"word": "て", "start": 33.351, "end": 33.371, "score": 0.0, "speaker": "SPEAKER_02"\},
{"word": "い", "start": 33.371, "end": 33.391, "score": 0.003, "speaker": "SPEAKER_02"\},
{"word": "ま", "start": 33.391, "end": 33.511, "score": 0.833, "speaker": "SPEAKER_02"\},
{"word": "す", "start": 33.511, "end": 37.272, "score": 0.995, "speaker": "SPEAKER_02"\},
{"word": "日", "start": 37.272, "end": 37.532, "score": 0.91, "speaker": "SPEAKER_01"\},
{"word": "銀", "start": 37.532, "end": 37.732, "score": 0.904, "speaker": "SPEAKER_01"\},
{"word": "は", "start": 37.732, "end": 38.192, "score": 0.957, "speaker": "SPEAKER_01"\},
{"word": "去", "start": 38.192, "end": 38.332, "score": 0.857, "speaker": "SPEAKER_01"\},
{"word": "年", "start": 38.332, "end": 38.632, "score": 1.0, "speaker": "SPEAKER_01"\},
{"word": "1", "start": 38.632, "end": 38.792, "score": 0.874, "speaker": "SPEAKER_01"\},
{"word": "2", "start": 38.792, "end": 38.912, "score": 0.833, "speaker": "SPEAKER_01"\},
{"word": "月", "start": 38.912, "end": 39.132, "score": 0.908, "speaker": "SPEAKER_01"\},
{"word": "の", "start": 39.132, "end": 39.352, "score": 0.996, "speaker": "SPEAKER_01"\},
{"word": "金", "start": 39.352, "end": 39.492, "score": 0.886, "speaker": "SPEAKER_01"\},
{"word": "融", "start": 39.492, "end": 39.672, "score": 0.887, "speaker": "SPEAKER_01"\},
{"word": "政", "start": 39.672, "end": 39.892, "score": 0.908, "speaker": "SPEAKER_01"\},
{"word": "策", "start": 39.892, "end": 40.132, "score": 0.908, "speaker": "SPEAKER_01"\},
{"word": "決", "start": 40.132, "end": 40.313, "score": 0.873, "speaker": "SPEAKER_01"\},
{"word": "定", "start": 40.313, "end": 40.473, "score": 0.863, "speaker": "SPEAKER_01"\},
{"word": "会", "start": 40.473, "end": 40.673, "score": 0.706, "speaker": "SPEAKER_01"\},
{"word": "合", "start": 40.673, "end": 40.833, "score": 0.874, "speaker": "SPEAKER_01"\},
{"word": "で", "start": 40.833, "end": 41.473, "score": 0.986, "speaker": "SPEAKER_01"\},
{"word": "政", "start": 41.473, "end": 41.653, "score": 0.888, "speaker": "SPEAKER_01"\},
{"word": "策", "start": 41.653, "end": 41.873, "score": 0.909, "speaker": "SPEAKER_01"\},
{"word": "金", "start": 41.873, "end": 42.093, "score": 0.959, "speaker": "SPEAKER_01"\},
{"word": "利", "start": 42.093, "end": 42.213, "score": 0.833, "speaker": "SPEAKER_01"\},
{"word": "を", "start": 42.213, "end": 42.753, "score": 0.933, "speaker": "SPEAKER_01"\},
{"word": "0", "start": 42.753, "end": 42.953, "score": 0.9, "speaker": "SPEAKER_01"\},
{"word": ".", "start": 42.953, "end": 43.073, "score": 0.969, "speaker": "SPEAKER_01"\},
{"word": "7", "start": 43.073, "end": 43.413, "score": 0.93, "speaker": "SPEAKER_01"\},
{"word": "5", "start": 43.413, "end": 43.793, "score": 0.899, "speaker": "SPEAKER_01"\},
{"word": "%", "start": 43.793, "end": 43.913, "score": 0.964, "speaker": "SPEAKER_01"\},
{"word": "に", "start": 43.913, "end": 44.013, "score": 0.8, "speaker": "SPEAKER_01"\},
{"word": "引", "start": 44.013, "end": 44.033, "score": 0.0, "speaker": "SPEAKER_01"\},
{"word": "き", "start": 44.033, "end": 44.233, "score": 0.889, "speaker": "SPEAKER_01"\},
{"word": "上", "start": 44.233, "end": 44.333, "score": 0.792, "speaker": "SPEAKER_01"\},
{"word": "げ", "start": 44.333, "end": 44.473, "score": 0.893, "speaker": "SPEAKER_01"\},
{"word": "て", "start": 44.473, "end": 44.553, "score": 0.839, "speaker": "SPEAKER_01"\},
{"word": "い", "start": 44.553, "end": 44.633, "score": 0.751, "speaker": "SPEAKER_01"\},
{"word": "ま", "start": 44.633, "end": 44.753, "score": 0.999, "speaker": "SPEAKER_01"\},
{"word": "す", "start": 44.753, "end": 44.873, "score": 0.991, "speaker": "SPEAKER_01"\},
{"word": "が", "start": 44.873, "end": 45.934, "score": 0.983, "speaker": "SPEAKER_01"\},
{"word": "上", "start": 45.934, "end": 46.074, "score": 0.856, "speaker": "SPEAKER_01"\},
{"word": "田", "start": 46.074, "end": 46.194, "score": 0.833, "speaker": "SPEAKER_01"\},
{"word": "総", "start": 46.194, "end": 46.374, "score": 0.888, "speaker": "SPEAKER_01"\},
{"word": "裁", "start": 46.374, "end": 46.574, "score": 0.898, "speaker": "SPEAKER_01"\},
{"word": "は", "start": 46.574, "end": 47.374, "score": 0.975, "speaker": "SPEAKER_01"\},
{"word": "こ", "start": 47.374, "end": 47.454, "score": 0.751, "speaker": "SPEAKER_01"\},
{"word": "れ", "start": 47.454, "end": 47.594, "score": 0.993, "speaker": "SPEAKER_01"\},
{"word": "ま", "start": 47.594, "end": 47.714, "score": 0.943, "speaker": "SPEAKER_01"\},
{"word": "で", "start": 47.714, "end": 47.854, "score": 0.997, "speaker": "SPEAKER_01"\},
{"word": "の", "start": 47.854, "end": 47.974, "score": 0.864, "speaker": "SPEAKER_01"\},
{"word": "利", "start": 47.974, "end": 48.094, "score": 0.832, "speaker": "SPEAKER_01"\},
{"word": "上", "start": 48.094, "end": 48.194, "score": 0.798, "speaker": "SPEAKER_01"\},
{"word": "げ", "start": 48.194, "end": 48.314, "score": 0.878, "speaker": "SPEAKER_01"\},
{"word": "に", "start": 48.314, "end": 48.454, "score": 0.965, "speaker": "SPEAKER_01"\},
{"word": "よ", "start": 48.454, "end": 48.534, "score": 0.781, "speaker": "SPEAKER_01"\},
{"word": "っ", "start": 48.534, "end": 48.614, "score": 0.753, "speaker": "SPEAKER_01"\},
{"word": "て", "start": 48.614, "end": 48.734, "score": 0.879, "speaker": "SPEAKER_01"\},
{"word": "も", "start": 48.734, "end": 49.314, "score": 0.999, "speaker": "SPEAKER_01"\},
{"word": "金", "start": 49.314, "end": 49.514, "score": 0.939, "speaker": "SPEAKER_01"\},
{"word": "融", "start": 49.514, "end": 49.954, "score": 0.948, "speaker": "SPEAKER_01"\},
{"word": "経", "start": 49.954, "end": 50.135, "score": 0.887, "speaker": "SPEAKER_01"\},
{"word": "済", "start": 50.135, "end": 50.315, "score": 0.888, "speaker": "SPEAKER_01"\},
{"word": "活", "start": 50.315, "end": 50.555, "score": 0.913, "speaker": "SPEAKER_01"\},
{"word": "動", "start": 50.555, "end": 50.735, "score": 0.888, "speaker": "SPEAKER_01"\},
{"word": "は", "start": 50.735, "end": 50.875, "score": 1.0, "speaker": "SPEAKER_01"\},
{"word": "抑", "start": 50.875, "end": 51.095, "score": 0.879, "speaker": "SPEAKER_01"\},
{"word": "制", "start": 51.095, "end": 51.335, "score": 0.929, "speaker": "SPEAKER_01"\},
{"word": "さ", "start": 51.335, "end": 51.455, "score": 0.838, "speaker": "SPEAKER_01"\},
{"word": "れ", "start": 51.455, "end": 51.575, "score": 0.833, "speaker": "SPEAKER_01"\},
{"word": "て", "start": 51.575, "end": 51.695, "score": 0.833, "speaker": "SPEAKER_01"\},
{"word": "い", "start": 51.695, "end": 51.795, "score": 0.805, "speaker": "SPEAKER_01"\},
{"word": "な", "start": 51.795, "end": 51.975, "score": 0.983, "speaker": "SPEAKER_01"\},
{"word": "い", "start": 51.975, "end": 51.995, "score": 0.035, "speaker": "SPEAKER_01"}],
"avg_logprob": -0.031129870668683255, "speaker": "SPEAKER_01"\},

より厳密には，WhisperXでは次のコードに似たロジックによってセグメントごとの話者を特定しているようです．


for word in words:
    duration = word.end - word.start
    speaker_sum[word.speaker] += duration

segment_speaker = argmax(speaker_sum)

したがって，音声データから議事録を作成するなど誰の発話であるかを厳密に取得する必要がある場合は，JSONファイルの情報も活用してセグメントを区切り直すなどの処理を加えると良いでしょう．

目次に戻る

« 戻る次へ »