How can I invert a MelSpectrogram with torchaudio and get an audio waveform?
Question:
I have a MelSpectrogram
generated from:
eval_seq_specgram = torchaudio.transforms.MelSpectrogram(sample_rate=sample_rate, n_fft=256)(eval_audio_data).transpose(1, 2)
So eval_seq_specgram
now has a size
of torch.Size([1, 128, 499])
, where 499 is the number of timesteps and 128 is the n_mels
.
I’m trying to invert it, so I’m trying to use GriffinLim
, but before doing that, I think I need to invert the melscale
, so I have:
inverse_mel_pred = torchaudio.transforms.InverseMelScale(sample_rate=sample_rate, n_stft=256)(eval_seq_specgram)
inverse_mel_pred
has a size
of torch.Size([1, 256, 499])
Then I’m trying to use GriffinLim
:
pred_audio = torchaudio.transforms.GriffinLim(n_fft=256)(inverse_mel_pred)
but I get an error:
Traceback (most recent call last):
File "evaluate_spect.py", line 63, in <module>
main()
File "evaluate_spect.py", line 51, in main
pred_audio = torchaudio.transforms.GriffinLim(n_fft=256)(inverse_mel_pred)
File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torchaudio/transforms.py", line 169, in forward
return F.griffinlim(specgram, self.window, self.n_fft, self.hop_length, self.win_length, self.power,
File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torchaudio/functional.py", line 179, in griffinlim
inverse = torch.istft(specgram * angles,
RuntimeError: The size of tensor a (256) must match the size of tensor b (129) at non-singleton dimension 1
Not sure what I’m doing wrong or how to resolve this.
Answers:
Just from looking at the Torch Documentation.
The shape of the input to the Griffith reconstruction should be
(...,freq,frame)
. Here freq is n_fft/2 + 1
(presumably it omits the negative frequencies).
Therefore, if you did a 256 FFT than the shape of inverse_mel_pred
should be [1,129,499]
not [1,256,499]
. To get this shape you should just omit all the negative frequency bins of each spectrogram in inverse_mel_pred
. I don’t use Torch, but generally the bins are ordered from negative to positive frequencies (and Torch’s utilities are just wrappers for other tools so I am fairly sure it does the same). Therefore to get the desired shape:
inverse_mel_pred = inverse_mel_pred[:,128::,:]
Then pass it to GriffinLim
just like you already did.
I might be off by one or so in the line above so make sure the shape is correct.
The input:
specgram
(Tensor) has the shape (…, freq, frames)
, where freq
is n_fft // 2 + 1
So if inverse_mel_pred
has a size of torch.Size([1, 256, 499]), n_fft
should (256 – 1) * 2 =510
By looking at the documentation and by doing a quick test on colab it seems that:
- When you create the MelSpectrogram with n_ftt = 256, 256/2+1 = 129 bins are generated
- At the same time InverseMelScale took as input the parameter called n_stft that indicates the number of bins (so in your case should be set to 129)
As a side note, I don’t understand why you need the transpose call, since according to the doc and my tests
waveform, sample_rate = torchaudio.load('test.wav')
mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform) # (channel, n_mels, time)
already returns a (channel, n_mels, time) tensor
and InverseMelScale wants a tensor of shape (…, n_mels, time)
Just for history, full code:
import torch
import torchaudio
import IPython
waveform, sample_rate = torchaudio.load("wavs/LJ030-0196.wav", normalize=True)
n_fft = 256
n_stft = int((n_fft//2) + 1)
transofrm = torchaudio.transforms.MelSpectrogram(sample_rate, n_fft=n_fft)
invers_transform = torchaudio.transforms.InverseMelScale(sample_rate=sample_rate, n_stft=n_stft)
grifflim_transform = torchaudio.transforms.GriffinLim(n_fft=n_fft)
mel_specgram = transofrm(waveform)
inverse_waveform = invers_transform(mel_specgram)
pseudo_waveform = grifflim_transform(inverse_waveform)
And
IPython.display.Audio(waveform.numpy(), rate=sample_rate)
IPython.display.Audio(pseudo_waveform.numpy(), rate=sample_rate)
I have a MelSpectrogram
generated from:
eval_seq_specgram = torchaudio.transforms.MelSpectrogram(sample_rate=sample_rate, n_fft=256)(eval_audio_data).transpose(1, 2)
So eval_seq_specgram
now has a size
of torch.Size([1, 128, 499])
, where 499 is the number of timesteps and 128 is the n_mels
.
I’m trying to invert it, so I’m trying to use GriffinLim
, but before doing that, I think I need to invert the melscale
, so I have:
inverse_mel_pred = torchaudio.transforms.InverseMelScale(sample_rate=sample_rate, n_stft=256)(eval_seq_specgram)
inverse_mel_pred
has a size
of torch.Size([1, 256, 499])
Then I’m trying to use GriffinLim
:
pred_audio = torchaudio.transforms.GriffinLim(n_fft=256)(inverse_mel_pred)
but I get an error:
Traceback (most recent call last):
File "evaluate_spect.py", line 63, in <module>
main()
File "evaluate_spect.py", line 51, in main
pred_audio = torchaudio.transforms.GriffinLim(n_fft=256)(inverse_mel_pred)
File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torchaudio/transforms.py", line 169, in forward
return F.griffinlim(specgram, self.window, self.n_fft, self.hop_length, self.win_length, self.power,
File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torchaudio/functional.py", line 179, in griffinlim
inverse = torch.istft(specgram * angles,
RuntimeError: The size of tensor a (256) must match the size of tensor b (129) at non-singleton dimension 1
Not sure what I’m doing wrong or how to resolve this.
Just from looking at the Torch Documentation.
The shape of the input to the Griffith reconstruction should be
(...,freq,frame)
. Here freq is n_fft/2 + 1
(presumably it omits the negative frequencies).
Therefore, if you did a 256 FFT than the shape of inverse_mel_pred
should be [1,129,499]
not [1,256,499]
. To get this shape you should just omit all the negative frequency bins of each spectrogram in inverse_mel_pred
. I don’t use Torch, but generally the bins are ordered from negative to positive frequencies (and Torch’s utilities are just wrappers for other tools so I am fairly sure it does the same). Therefore to get the desired shape:
inverse_mel_pred = inverse_mel_pred[:,128::,:]
Then pass it to GriffinLim
just like you already did.
I might be off by one or so in the line above so make sure the shape is correct.
The input:
specgram
(Tensor) has the shape (…, freq, frames)
, where freq
is n_fft // 2 + 1
So if inverse_mel_pred
has a size of torch.Size([1, 256, 499]), n_fft
should (256 – 1) * 2 =510
By looking at the documentation and by doing a quick test on colab it seems that:
- When you create the MelSpectrogram with n_ftt = 256, 256/2+1 = 129 bins are generated
- At the same time InverseMelScale took as input the parameter called n_stft that indicates the number of bins (so in your case should be set to 129)
As a side note, I don’t understand why you need the transpose call, since according to the doc and my tests
waveform, sample_rate = torchaudio.load('test.wav')
mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform) # (channel, n_mels, time)
already returns a (channel, n_mels, time) tensor
and InverseMelScale wants a tensor of shape (…, n_mels, time)
Just for history, full code:
import torch
import torchaudio
import IPython
waveform, sample_rate = torchaudio.load("wavs/LJ030-0196.wav", normalize=True)
n_fft = 256
n_stft = int((n_fft//2) + 1)
transofrm = torchaudio.transforms.MelSpectrogram(sample_rate, n_fft=n_fft)
invers_transform = torchaudio.transforms.InverseMelScale(sample_rate=sample_rate, n_stft=n_stft)
grifflim_transform = torchaudio.transforms.GriffinLim(n_fft=n_fft)
mel_specgram = transofrm(waveform)
inverse_waveform = invers_transform(mel_specgram)
pseudo_waveform = grifflim_transform(inverse_waveform)
And
IPython.display.Audio(waveform.numpy(), rate=sample_rate)
IPython.display.Audio(pseudo_waveform.numpy(), rate=sample_rate)