top of page

Speech Emotion Recognition

Speech is a complex signal, characterized by various features such as pitch, prosody, formant frequencies, spectral density, duration and the location of plosives and fricatives. Understanding the emotional context of a signal requires segmenting each utterance,tracking all of these features over time and analyzing their relationships with each other. 

Numerous studies have attempted to use hand-crafted features for emotion recognition such as prosody features (pitch, intensity, and duration), voice quality features (formants, spectral energy distribution, and harmonics-to-noise ratio), and spectral features (MFCC, LPC, PLP coefficients). Emotions are predicted from these features using linear techniques such as SVM and clustering. Hand-crafted features however fail to capture the emotional context of speech as they are not expressive enough to reflect the complex temporal relationships between various speech occurrences.

In this project, we attempt to analyze a deep learning based approach that automatically learns features from audio spectrograms for the purpose of emotion recognition and compares their results with a hand-crafted approach.

In the proposed approach, audio spectrograms are segmented using context windows of various sizes. For each segment, the mel- spectrogram is computed, along with the delta and delta-delta features. The resulting segment is therefore represented by three distinct images that are analogous to the 3 RGB channels in an ordinary image. These segments are then used as input to the CNN, which learns a feature representation for the segment. The feature vectors are fed to an LSTM that predicts the emotion of the audio sample. This process is repeated for various context window sizes and the final emotion of the sample is determined by weighting the individual results. 

The CRNN network achieved an accuracy of over 79%, significantly higher than systems that use linear approaches.

bottom of page