4.4 Article

DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features

Journal

CIRCUITS SYSTEMS AND SIGNAL PROCESSING
Volume 40, Issue 1, Pages 466-489

Publisher

SPRINGER BIRKHAUSER
DOI: 10.1007/s00034-020-01486-8

Keywords

Emotion recognition; Epoch-based features; Deep neural network (DNN); Gaussian mixture model (GMM); Hidden Markov model (HMM); Speaker-adaptive training (SAT); Zero-time windowing (ZTW)

Funding

  1. Young Faculty Research Fellowship (YFRF) of Visvesvaraya PhD Programme of Ministry of Electronics & Information Technology, MeitY, Government of India

Ask authors/readers for more resources

This paper proposes a speaker-adaptive DNN-HMM SER system that utilizes both MFCC and epoch-based features, incorporating speaker adaptation using feature space maximum likelihood linear regression technique. Experimental results emphasize the importance of speaker adaptation for SER systems and the complementary nature of MFCC and epoch-based features in emotion recognition.
Speech emotion recognition (SER) systems are often evaluated in a speaker-independent manner. However, the variation in the acoustic features of different speakers used during training and evaluation results in a significant drop in the accuracy during evaluation. While speaker-adaptive techniques have been used for speech recognition, to the best of our knowledge, they have not been employed for emotion recognition. Motivated by this, a speaker-adaptive DNN-HMM-based SER system is proposed in this paper. Feature space maximum likelihood linear regression technique has been used for speaker adaptation during both training and testing phases. The proposed system uses MFCC and epoch-based features. We have exploited our earlier work on robust detection of epochs from emotional speech to obtain emotion-specific epoch-based features, namely instantaneous pitch, phase, and the strength of excitation. The combined feature set improves on the MFCC features, which have been the baseline for SER systems in the literature by + 5.07% and over the state-of-the-art techniques by + 7.13 %. While using just the MFCC features, the proposed model improves upon the state-of-the-art techniques by 2.06%. These results bring out the importance of speaker adaptation for SER systems and highlight the complementary nature of the MFCC and epoch-based features for emotion recognition using speech. All experiments were carried out an IEMOCAP emotional dataset.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.4
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available