Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[RAVDESS] Speech Emotion Recognition with Convolutional Attention based Bi-GRU. (Best test accuracy of 87%)

License

Notifications You must be signed in to change notification settings

hwang9u/emocatcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

17 Commits

Repository files navigation

Speech Emotion Recognition on RAVDESS πŸ˜€

βœ… This is a CABG(Convolutional Attention based Bi-directional GRU) model named "EmoCatcher".

  • Convolutional part includes 3 ConvBlock for extracting local information from mel spectrogram input.
    ConvBlock : Conv1d -> ConvLN -> GELU -> Dropout
  • The returned output of Convolutional part goes through Maxpool1d and LayerNormalization.
  • GRU part extracts global information bidirectionally. GRU has an advantage in handling variable-length input.
  • BahdanauAttention is applied at the end of GRU. This enables the model to know where to pay attention.

βœ… Best test accuracy 87.15% on a hold-out dataset. (During the last 10 epochs, mean accuracy was 85.26%).

  • train/test dataset with a proportion of 8:2.
  • Stratified sampling from the emotion class distribution
  • Mean test accuracy of 84.17(Β± 2.36)% in Stratified 5-fold cross-validation

βœ… To increase the computational efficiency, VAD(Voice Activity Detection) is applied in the preprocessing stage.

  • The start and end points are obtained from power of mel spectrogram.
  • This process helps the model to learn features of the voiced region. There cannot be any emotion in the silent regions(before/after speech), hence they are excluded from analysis.

RAVDESS Dataset

  • Speech audio-only files from Full Dataset
  • With 8 emotion classes: neutral, calm, happy, sad, angry, fearful, surprise, and disgust
    • View more dataset information here
  • Without augmentation or extra-dataset

Implementation

  • Loss function: LabelSmoothingLoss (source)
  • Optimizer: adabelief_pytorch.AdaBelief (source)
  • Scheduler: torch.optim.lr_scheduler.ReduceLROnPlateau
  • Please check train.py for more detailed hyper-parameters.

Performance Evaluation

Hold-Out

Accuracy & Loss curve


Best Test Accuracy: 87.15% / Loss: 0.86552

+ Last 10 epohcs (mean Β± std)

Train Accuracy: 0.98881(Β± 0.00367)
Train Loss: 0.63214(Β± 0.00590)
Test Accuracy: 0.85262(Β± 0.01182)
Test Loss: 0.86931(Β± 0.01257)
  • ❗️The mean accuracy during the last 10 epochs may be more reliable to us than a simple best accuracy.

Confusion Matrix

  • We can see the confusion matrix(8 x 8). Rows represent the actual emotion classes while columns represent the predicted classes.
  • Normalized confusion matrix shows recall for each emotion class.
  • Accuracy metric is the same as WAR(Weighted Average Recall) since I did stratified sampling.

  • ❗️ We can see that the the model is most confused about the happy class.


5-Fold CV

Due to the small size of the dataset, there was a concern that the hold-out output might be biased, so I attempted to perform 5-fold cross-validation.


Accuracy & Loss curve


[Fold 1] Best Test Accuracy: 0.84375 / Loss: 0.97918
[Fold 2] Best Test Accuracy: 0.80903 / Loss: 1.04934
[Fold 3] Best Test Accuracy: 0.85764 / Loss: 0.87293
[Fold 4] Best Test Accuracy: 0.87500 / Loss: 0.90747
[Fold 5] Best Test Accuracy: 0.82292 / Loss: 0.98223
[5-Fold CV Best Test Accuracy] max: 0.87500 min: 0.80903 mean: 0.84167 std: 0.02361
  • ❗️ The mean test accuracy is approximately 84%, which is slightly (almost 3%p) lower than the hold-out result, but overall showing good performance.
  • ❗️ One drawback is that there is significant variation in test performance across folds, which could be due to shuffling of the data, but it could also be a reproducibility issue with the optimizer.

Confusion Matrix

  • The worst-performing and best-performing output are compared through the following figure.
  • The confusion matrices for all folds can be found in /output/cv5/img/

Lowest in the Fold 2

Highest in the Fold 4

  • ❗️ There was a significant gap between the highest and lowest values, especially the accuracy for "happy" class was notably low.

Outro

  • With EmoCatcher, I achieved a test accuracy of 87.15% on a hold-out dataset and mean test accuracy of 84.17% under 5-fold CV.
  • I have personally tested this on the CPU. According to the PyTorch documentation, it is not guaranteed that output will be reproducible across different devices. Since I haven't conducted many experiments yet, I think more experiments should be carried out to find consistent parameters to increase reproducibility of the above output.
  • There was overfitting of the train data, which was probably due to small dataset size. (Insufficient data size may have made performance estimates noisy.) Given the circumstances, it was a good enough performance.
  • When I listened and guessed it, my accuracy was about 60%.πŸ˜… I think it was difficult to detect emotional characteristics contained in the speech due to cultural differences. (I'm Korean.)
  • So next time, I will challenge Korean Speech Emotion Recognition.

Appendix

The following figures are the output of the best model on the hold-out: output/holdout/model/best_model_0.8715_0.8655.pth

Attention Weights

  • We can check which parts the attention mechanism focuses on for each class through the following figure.

Predicted Examples

  • Plot title format: true emotion class/predicted emotion class(predicted probability x 100 %)

+ Cite

If you want to use this code, please cite as follows:

@misc{hwang9u-emocatcher,
 author = {Kim, Seonju},
 title = {emocatcher},
 year = {2023},
 publisher = {GitHub},
 journal = {GitHub repository},
 howpublished = {\url{https://github.com/hwang9u/emocatcher}},
}

About

[RAVDESS] Speech Emotion Recognition with Convolutional Attention based Bi-GRU. (Best test accuracy of 87%)

Topics

Resources

License

Stars

Watchers

Forks

Languages

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /