The last part of my speech recognition series: finally training my network. Here's the dataset I did it with (self-generated, small I know), and the code I used.
After running this code (takes about an hour on my Mac), I get a validation accuracy of roughly 30%... not spectacular. Any ideas on how to improve the training speed, or the neural network's accuracy? Any other suggestions in general?
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import numpy as np
import tflearn
def main():
LABELED_DIR = 'labeled_data'
width = 512
height = 512
classes = 26 # characters
learning_rate = 0.0001
batch_size = 25
# load data
print('Loading data')
X, Y = tflearn.data_utils.image_preloader(LABELED_DIR, image_shape=(width, height), mode='folder', normalize=True, grayscale=True, categorical_labels=True, files_extension=None, filter_channel=False)
X_shaped = np.squeeze(X)
trainX, trainY = X_shaped, Y
# Network building
print('Building network')
net = tflearn.input_data(shape=[None, width, height])
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, classes, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy')
model = tflearn.DNN(net, tensorboard_verbose=3)
print('Training network')
model.fit(trainX, trainY, validation_set=0.15, n_epoch=100, show_metric=True, batch_size=batch_size)
model.save("tflearn.lstm.model")
if __name__ == '__main__':
main()
1 Answer 1
First things first: you can get far better results by fine-tuning the arguments.
Yes, it goes over 52% at times and I'm sure it can go even higher. Let's take a look at what colours are caused by what.
ORE1X4
default settings
EAPX5J
Epoch 200 instead of 100
6DHKQJ
Dropout 0.9 instead of 0.8
53O25D
learning_rate 1e-3 instead of 1e-4
0TWRK8
256 instead of 128
QCAZN8
Epoch 300, dropout 0.9, learning_rate 1e-3, 256 instead of 128
H57Z4I
learning_rate 1e-3, epoch 400
I've rewritten 0.0001
as 1e-4
, easier on the eyes. And you made a good start by putting it in a variable (which, according to the PEP8, should be CAPITAL_CASED
since they're pseudo-constants). So why didn't you put the others in variables as well? Look at how the functions you call have named their arguments and use this as inspiration for your variable names.
Keep in mind you're using spectograms as input. Spectrograms have some downsides, since they only measure intensity of frequencies and not the phase. You may have heard this problem described as the phase problem. This means every spectrogram has broadband noise, impacting the overall effectiveness of your output. The measured effectiveness might not even be the real effectiveness, since it probably assumes you actually like the noise.
So, not only could you use more data to achieve a higher accuracy, you may eventually need more complete data. As in, less noise and with phase information.
As for performance, there's not much you can do. Your code runs significantly faster on my laptop than on your Mac (original set-up in under 15 minutes), even without using a GPU as acceleration. Tensorflow is pretty well optimized to use multiple cores.
Keep in mind the X-axis displays steps. The time it takes to reach a certain amount of steps can vary wildly depending on the arguments you provide. 0TWRK8 took 3 times as long to reach step 500 than H57Z4I, while the latter appears to be scoring better. Figure out which arguments are 'worth their weight' and which simply slow you down for little to no gain.
My advice? Experiment! After a couple hundred epochs the data will just about flatline, so going above 200 isn't particularly useful when going for sample runs.
Fidgeting with the input reminded me of a game I played a long while back: foldit
There's the early game, the mid game and the end game. In the early game, you're looking for the big changes. The later you get, the more your focus shifts to different aspects to fine-tune your approach. However, if you had an inefficient start, you couldn't fine-tune enough to reach the score you wanted. The score would flat-line.
Consider developing this machine in the same manner. Don't rush the development to make it go fast if that will hurt it's accuracy in the end. After all, nothing is as annoying as speech recognition that only works half the time. If you need certain functions to keep your output in good quality, don't optimize it away only to regret it later.
Something else your dataset doesn't take into account, is combinations of characters. ch
isn't exactly pronounced as a combination of c
and h
. The same goes for ph
, th
and other combinations. Keep this in mind when you start field testing your network.
-
\$\begingroup\$ Would you mind sharing the name of the tool/framework that you used to create the charts? :) \$\endgroup\$Yoryo– Yoryo2018年03月29日 22:21:53 +00:00Commented Mar 29, 2018 at 22:21
-
1\$\begingroup\$ @Yoryo Load the logs with Tensorboard and it does all the plotting automagically. \$\endgroup\$2018年03月29日 22:22:44 +00:00Commented Mar 29, 2018 at 22:22
Explore related questions
See similar questions with these tags.
ch
,ph
,th
and the likes, you may just as well need a new dataset. \$\endgroup\$