Speech Recognition Part 3: Training the Neural Network

Question 1

The last part of my speech recognition series: finally training my network. Here's the dataset I did it with (self-generated, small I know), and the code I used.

After running this code (takes about an hour on my Mac), I get a validation accuracy of roughly 30%... not spectacular. Any ideas on how to improve the training speed, or the neural network's accuracy? Any other suggestions in general?

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import numpy as np
import tflearn
def main():
 LABELED_DIR = 'labeled_data'
 width = 512
 height = 512
 classes = 26 # characters
 learning_rate = 0.0001
 batch_size = 25
 # load data
 print('Loading data')
 X, Y = tflearn.data_utils.image_preloader(LABELED_DIR, image_shape=(width, height), mode='folder', normalize=True, grayscale=True, categorical_labels=True, files_extension=None, filter_channel=False)
 X_shaped = np.squeeze(X)
 trainX, trainY = X_shaped, Y
 # Network building
 print('Building network')
 net = tflearn.input_data(shape=[None, width, height])
 net = tflearn.lstm(net, 128, dropout=0.8)
 net = tflearn.fully_connected(net, classes, activation='softmax')
 net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy')
 model = tflearn.DNN(net, tensorboard_verbose=3)
 print('Training network')
 model.fit(trainX, trainY, validation_set=0.15, n_epoch=100, show_metric=True, batch_size=batch_size)
 model.save("tflearn.lstm.model")
if __name__ == '__main__':
 main()

Question 2

You admit you have a small dataset. Still you're wondering how to improve the accuracy. If you'd half your dataset, the accuracy should drop significantly. You can test this. Now, with the difference in mind, guess what the new accuracy would be if you'd double your dataset instead.

Question 3

Is there a reason you went with images as dataset instead of audio files? I'd imagine there's a small error margin on the pictures, adding to the trouble.

Question 4

@Mast It's not necessarily a linear relationship though, when I was first setting up the network I had only three letters with several images for each of them and was achieving roughly 90% accuracy. Though I do agree in general more data would lead to better accuracies.

Question 5

Those were probably easy letters. For combinations, like ch, ph, th and the likes, you may just as well need a new dataset.

Question 6

How do you calculate your accuracy?

Question 7

First things first: you can get far better results by fine-tuning the arguments.

Accuracy/Validation

Yes, it goes over 52% at times and I'm sure it can go even higher. Let's take a look at what colours are caused by what.

Legend

ORE1X4
default settings
EAPX5J
Epoch 200 instead of 100
6DHKQJ
Dropout 0.9 instead of 0.8
53O25D
learning_rate 1e-3 instead of 1e-4
0TWRK8
256 instead of 128
QCAZN8
Epoch 300, dropout 0.9, learning_rate 1e-3, 256 instead of 128
H57Z4I
learning_rate 1e-3, epoch 400

I've rewritten 0.0001 as 1e-4, easier on the eyes. And you made a good start by putting it in a variable (which, according to the PEP8, should be CAPITAL_CASED since they're pseudo-constants). So why didn't you put the others in variables as well? Look at how the functions you call have named their arguments and use this as inspiration for your variable names.

Keep in mind you're using spectograms as input. Spectrograms have some downsides, since they only measure intensity of frequencies and not the phase. You may have heard this problem described as the phase problem. This means every spectrogram has broadband noise, impacting the overall effectiveness of your output. The measured effectiveness might not even be the real effectiveness, since it probably assumes you actually like the noise.

So, not only could you use more data to achieve a higher accuracy, you may eventually need more complete data. As in, less noise and with phase information.

As for performance, there's not much you can do. Your code runs significantly faster on my laptop than on your Mac (original set-up in under 15 minutes), even without using a GPU as acceleration. Tensorflow is pretty well optimized to use multiple cores.

Keep in mind the X-axis displays steps. The time it takes to reach a certain amount of steps can vary wildly depending on the arguments you provide. 0TWRK8 took 3 times as long to reach step 500 than H57Z4I, while the latter appears to be scoring better. Figure out which arguments are 'worth their weight' and which simply slow you down for little to no gain.

My advice? Experiment! After a couple hundred epochs the data will just about flatline, so going above 200 isn't particularly useful when going for sample runs.

Fidgeting with the input reminded me of a game I played a long while back: foldit

There's the early game, the mid game and the end game. In the early game, you're looking for the big changes. The later you get, the more your focus shifts to different aspects to fine-tune your approach. However, if you had an inefficient start, you couldn't fine-tune enough to reach the score you wanted. The score would flat-line.

Consider developing this machine in the same manner. Don't rush the development to make it go fast if that will hurt it's accuracy in the end. After all, nothing is as annoying as speech recognition that only works half the time. If you need certain functions to keep your output in good quality, don't optimize it away only to regret it later.

Something else your dataset doesn't take into account, is combinations of characters. ch isn't exactly pronounced as a combination of c and h. The same goes for ph, th and other combinations. Keep this in mind when you start field testing your network.

Question 8

Would you mind sharing the name of the tool/framework that you used to create the charts? :)

Question 9

@Yoryo Load the logs with Tensorboard and it does all the plotting automagically.

Mast ♦Mast 13.8k12 gold badges57 silver badges127 bronze badges · Accepted Answer · 2017-05-10 19:46:05Z