☆ 4.4 Article

DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features

CIRCUITS SYSTEMS AND SIGNAL PROCESSING (2021)

Journal

CIRCUITS SYSTEMS AND SIGNAL PROCESSING

Volume 40, Issue 1, Pages 466-489

Publisher

SPRINGER BIRKHAUSER

DOI: 10.1007/s00034-020-01486-8

Keywords

Emotion recognition; Epoch-based features; Deep neural network (DNN); Gaussian mixture model (GMM); Hidden Markov model (HMM); Speaker-adaptive training (SAT); Zero-time windowing (ZTW)

Funding

Young Faculty Research Fellowship (YFRF) of Visvesvaraya PhD Programme of Ministry of Electronics & Information Technology, MeitY, Government of India

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a speaker-adaptive DNN-HMM SER system that utilizes both MFCC and epoch-based features, incorporating speaker adaptation using feature space maximum likelihood linear regression technique. Experimental results emphasize the importance of speaker adaptation for SER systems and the complementary nature of MFCC and epoch-based features in emotion recognition.

Speech emotion recognition (SER) systems are often evaluated in a speaker-independent manner. However, the variation in the acoustic features of different speakers used during training and evaluation results in a significant drop in the accuracy during evaluation. While speaker-adaptive techniques have been used for speech recognition, to the best of our knowledge, they have not been employed for emotion recognition. Motivated by this, a speaker-adaptive DNN-HMM-based SER system is proposed in this paper. Feature space maximum likelihood linear regression technique has been used for speaker adaptation during both training and testing phases. The proposed system uses MFCC and epoch-based features. We have exploited our earlier work on robust detection of epochs from emotional speech to obtain emotion-specific epoch-based features, namely instantaneous pitch, phase, and the strength of excitation. The combined feature set improves on the MFCC features, which have been the baseline for SER systems in the literature by + 5.07% and over the state-of-the-art techniques by + 7.13 %. While using just the MFCC features, the proposed model improves upon the state-of-the-art techniques by 2.06%. These results bring out the importance of speaker adaptation for SER systems and highlight the complementary nature of the MFCC and epoch-based features for emotion recognition using speech. All experiments were carried out an IEMOCAP emotional dataset.

DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features

Journal

CIRCUITS SYSTEMS AND SIGNAL PROCESSING

Publisher

SPRINGER BIRKHAUSER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features

Journal

CIRCUITS SYSTEMS AND SIGNAL PROCESSING

Publisher

SPRINGER BIRKHAUSER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper