Can I convert spectrograms generated with librosa back to audio?

Question 1

I converted some audio files to spectrograms and saved them to files using the following code:

import os
from matplotlib import pyplot as plt
import librosa
import librosa.display
import IPython.display as ipd
audio_fpath = "./audios/"
spectrograms_path = "./spectrograms/"
audio_clips = os.listdir(audio_fpath)
def generate_spectrogram(x, sr, save_name):
 X = librosa.stft(x)
 Xdb = librosa.amplitude_to_db(abs(X))
 fig = plt.figure(figsize=(20, 20), dpi=1000, frameon=False)
 ax = fig.add_axes([0, 0, 1, 1], frameon=False)
 ax.axis('off')
 librosa.display.specshow(Xdb, sr=sr, cmap='gray', x_axis='time', y_axis='hz')
 plt.savefig(save_name, quality=100, bbox_inches=0, pad_inches=0)
 librosa.cache.clear()
for i in audio_clips:
 audio_fpath = "./audios/"
 spectrograms_path = "./spectrograms/"
 audio_length = librosa.get_duration(filename=audio_fpath + i)
 j=60
 while j < audio_length:
 x, sr = librosa.load(audio_fpath + i, offset=j-60, duration=60)
 save_name = spectrograms_path + i + str(j) + ".jpg"
 generate_spectrogram(x, sr, save_name)
 j += 60
 if j >= audio_length:
 j = audio_length
 x, sr = librosa.load(audio_fpath + i, offset=j-60, duration=60)
 save_name = spectrograms_path + i + str(j) + ".jpg"
 generate_spectrogram(x, sr, save_name)

I wanted to keep the most detail and quality from the audios, so that i could turn them back to audio without too much loss (They are 80MB each).

Is it possible to turn them back to audio files? How can I do it?

Example spectrograms

I tried using librosa.feature.inverse.mel_to_audio, but it didn't work, and I don't think it applies.

I now have 1300 spectrogram files and want to train a Generative Adversarial Network with them, so that I can generate new audios, but I don't want to do it if i wont be able to listen to the results later.

Question 2

Not really - you’ve thrown away a lot of information (all of the phase, and some of the magnitude).

Question 3

@PaulR STFT typically contains a lot of redundant information that can be used to estimate the phase. It's hardly perfect, but if you combine Griffin-Lim Algorithm with e.g. advances in generative deep neural networks, it can get pretty good.

Question 4

@LukaszTracewski: very interesting - OP is only saving the log magnitude spectrum though (not sure if this is quantized ?) - do you think this will still work ?

Question 5

@PaulR It's a valid point that full inverse transformation is not possible (due to thresholding applied in amplitude_to_db and the saving to lossy format (jpeg). That being said, unless OP is dealing with some extreme cases, it should not be a big issue. The OP wants to "train a Generative Adversarial Network with them, so that I can generate new audios" and that's not an exact math anyway. Combine that with e.g. tensorflow/magenta and OP is off to a good start.

Question 6

Thanks - very interesting.

Question 7

Yes, it is possible to recover most of the signal and estimate the phase with e.g. Griffin-Lim Algorithm (GLA). Its "fast" implementation for Python can be found in librosa. Here's how you can use it:

import numpy as np
import librosa
y, sr = librosa.load(librosa.util.example_audio_file(), duration=10)
S = np.abs(librosa.stft(y))
y_inv = librosa.griffinlim(S)

And that's how the original and reconstruction look like:

reconstruction

The algorithm by default randomly initialises the phases and then iterates forward and inverse STFT operations to estimate the phases.

Looking at your code, to reconstruct the signal, you'd just need to do:

import numpy as np
X_inv = librosa.griffinlim(np.abs(X))

It's just an example of course. As pointed out by @PaulR, in your case you'd need to load the data from jpeg (which is lossy!) and then apply inverse transform to amplitude_to_db first.

The algorithm, especially the phase estimation, can be further improved thanks to advances in artificial neural networks. Here is one paper that discusses some enhancements.

Question 8

@RamonGriffo Good luck! Setting quality to 100 does not typically give you lossless compression, see e.g. this answer for details stackoverflow.com/questions/7982409/… If you can afford the space, use lossless format. I often go for HDF5, optionally with high compression. If that answers your question, please accept the answer - thanks!

Question 9

Did you find out how to load/transform the jpg image as a spectrogram?, I don't think this answer answers exactly that part.

Question 10

@materialvision That's because there's no unambiguous way how to do that. How can you tell how the colour scale from image translates into amplitude? With grayscale images at very least you know relative differences, so recovering a signal is not a big issue.

Question 11

@LukaszTracewski Thanks, and a hint on how to do it with greyscale image would also be great. I can do the griffinlim on a mel object but not directly on an image of a mel... so I am looking for a way to reverse the process. First generate images of spectrograms, train the model (with different existing image based GANSs) and generate resulting images and then transforming the new images back into sound. That last part is the problem.

Question 12

Does this work for image of a spectrogram ? If I have an image, pass the image as the input and get the audio from it. Can you please share the code snippet for the same.

Question 13

I did this ex-novo in 2016 to recover audio from spectrograms for which no audio was available. I didn't know about the GLA (thanks!) but the algorithm sounds similar, complete with random phases.

As regards importing the spectrograms, for mine you indicate the corners of the graphic and its pixels-per-second and frequency range, and the start and end points of the scale and its range, and a script does the color-to-dB mapping of the graph.

Code: https://gitlab.com/martinwguy/delia-derbyshire/-/tree/master/anal Examples of its output: https://wikidelia.net/wiki/Spectrograms#Inverse_spectrograms

Lukasz Tracewski 11.5k5 gold badges40 silver badges62 bronze badges · Accepted Answer · 2020-04-10 07:01:37Z

Yes, it is possible to recover most of the signal and estimate the phase with e.g. Griffin-Lim Algorithm (GLA). Its "fast" implementation for Python can be found in librosa. Here's how you can use it:

import numpy as np
import librosa
y, sr = librosa.load(librosa.util.example_audio_file(), duration=10)
S = np.abs(librosa.stft(y))
y_inv = librosa.griffinlim(S)

And that's how the original and reconstruction look like:

reconstruction

The algorithm by default randomly initialises the phases and then iterates forward and inverse STFT operations to estimate the phases.

Looking at your code, to reconstruct the signal, you'd just need to do:

import numpy as np
X_inv = librosa.griffinlim(np.abs(X))

It's just an example of course. As pointed out by @PaulR, in your case you'd need to load the data from jpeg (which is lossy!) and then apply inverse transform to amplitude_to_db first.

The algorithm, especially the phase estimation, can be further improved thanks to advances in artificial neural networks. Here is one paper that discusses some enhancements.

@RamonGriffo Good luck! Setting quality to 100 does not typically give you lossless compression, see e.g. this answer for details stackoverflow.com/questions/7982409/… If you can afford the space, use lossless format. I often go for HDF5, optionally with high compression. If that answers your question, please accept the answer - thanks!
Did you find out how to load/transform the jpg image as a spectrogram?, I don't think this answer answers exactly that part.
@materialvision That's because there's no unambiguous way how to do that. How can you tell how the colour scale from image translates into amplitude? With grayscale images at very least you know relative differences, so recovering a signal is not a big issue.
@LukaszTracewski Thanks, and a hint on how to do it with greyscale image would also be great. I can do the griffinlim on a mel object but not directly on an image of a mel... so I am looking for a way to reverse the process. First generate images of spectrograms, train the model (with different existing image based GANSs) and generate resulting images and then transforming the new images back into sound. That last part is the problem.
Does this work for image of a spectrogram ? If I have an image, pass the image as the input and get the audio from it. Can you please share the code snippet for the same.

CollectivesTM on Stack Overflow

Can I convert spectrograms generated with librosa back to audio?

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related