1

My keras model seems to have to hit a saddle point in it's training. Of course this is just an assumption; I'm not really sure. In any case, the loss stops at .0025 and nothing I have tried has worked to reduce the loss any further.

What I have tried so far is:

  1. Using Adam and RMSProp with and without cyclical learning rates. The Results are that the loss starts and stays .0989. The learning rates for cyclical learning where .001 to .1.

  2. After 4 or 5 epochs of not moving I tried SGD instead and the loss steadily declined too .0025. This is where the learning rate stalls out. After about 5 epochs of not changing I tried using SGD with cyclical learning enabled hoping it would decrease but I get the same result.

  3. I have tried increasing network capacity (as well as decreasing) thinking maybe the network hit it's learning limitations. I increased all 4 dense layers to 4096. That didn't change anything.

  4. I've tried different batch sizes.

The most epochs I have trained the network for is 7. However, for 6 of those epochs the loss or validation loss do not change. Do I need to train for more epochs or could it be that .0025 is not a saddle point but is the global minimum for my dataset? I would think there is more room for it to improve. I tested the predictions of the network at .0025 and they aren't that great.

Any advice on how to continue? My code is below.

For starters my keras model is similar in style to VGG-16:

# imports 
pip install -q -U tensorflow_addons
import tensorflow_addons as tfa
import tensorflow as tf
from tensorflow import keras
from keras import layers
def get_model(input_shape):
 input = keras.input(shape=input_shape)
 x = layers.Conv2D(filters=64, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.Conv2D(filters=64, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.MaxPooling2D(pool_size=(2, 2) strides=none, paddings="same")(x)
 x = layers.Conv2D(filters=128, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.Conv2D(filters=128, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.MaxPooling2D(pool_size=(2, 2) strides=none, paddings="same")(x)
 x = layers.Conv2D(filters=256, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.Conv2D(filters=256, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.Conv2D(filters=256, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.Conv2D(filters=256, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.MaxPooling2D(pool_size=(2, 2) strides=none, paddings="same")(x)
 x = layers.Conv2D(filters=512, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.Conv2D(filters=512, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.Conv2D(filters=512, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.Conv2D(filters=512, kernel_size= (3, 3), activation='relu', paddings="same")(input)
 x = layers.MaxPooling2D(pool_size=(2, 2) strides=none, paddings="same")(x)
 x = layers.Flatten()(x)
 x = layers.Dense(4096, activation='relu')(x)
 x = layers.Dense(2048, activation='relu')(x)
 x = layers.Dense(1024, activation='relu')(x)
 x = layers.Dense(512, activation='relu')(x)
 output = layers.Dense(9, activation='sigmoid')(x)
 return keras.models.Model(inputs=input, outputs=output)
# define learning rate range
lr_range = [.001, .1]
epochs = 100
batch_size = 32
# based on https://www.tensorflow.org/addons/tutorials/optimizers_cyclicallearningrate
steps_per_epoch = len(training_data)/batch_size
clr = tfa.optimizers.CyclicalLearningRate(initial_learning_rate=lr_range[0],
 maximal_learning_rate=lr_range[1],
 scale_fn=lambda x: 1/(2.**(x-1)),
 step_size=2 * steps_per_epoch
)
optimizer = tf.keras.optimizers.Adam(clr)
model = get_model((224, 224, 3))
model.compile(optimzer=optimzer, loss='mean_squared_error')
# used tf.dataset objects for model input
model.fit(train_ds, validation_data=valid_ds, batch_size=batch_size, epochs=epochs)
asked Feb 14, 2023 at 17:38
7
  • Why do you think it is a saddle point? Usually Adam or SGD is enough to move away from the saddle point (due to noisy gradient), It probably is something else. Does your label is a soft label i.e. [1 x 9] vector which summed to 1? If it is a discrete one i.e. a whole number, try switching to sparse_categorical_crossentropy instead Commented Feb 14, 2023 at 18:04
  • If the task is single class, classification try changing activation='sigmoid' to activation='softmax' (Assume from_logits=False in the crossentropy loss) Commented Feb 14, 2023 at 18:10
  • Well, I'm a bit confused because all of the literature I'm reading says Adam usually can escape a saddle point. The labels are floating point numbers between 0 and 1. They represent vertex coordinates, points in 3D space. Commented Feb 14, 2023 at 18:14
  • Are you trying to overfit the model? If so what happen when you reduce the training data size? Commented Feb 14, 2023 at 18:29
  • I plan on on adding more data later on. So I'm trying to leave some room for more capacity if needed. Although my hardware does not permit that much more. Removing the relu activation does not appear to change anything. I can try and reduce data size further. Will let you know. Thank you. Commented Feb 14, 2023 at 18:34

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.