4.6 Article

Speech Emotion Classification Using Attention-Based LSTM

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TASLP.2019.2925934

关键词

Speech emotion; frame-level features; LSTM; attention mechanism

资金

  1. National Natural Science Foundation of China [61871213, 61673108, 61571106]
  2. Six Talent Peaks Project in Jiangsu Province [2016-DZXX-023]
  3. Natural Science Foundation of Jiangsu Province [BK20161517]

向作者/读者索取更多资源

Automatic speech emotion recognition has been a research hotspot in the field of human-computer interaction over the past decade. However, due to the lack of research on the inherent temporal relationship of the speech waveform, the current recognition accuracy needs improvement. To make full use of the difference of emotional saturation between time frames, a novel method is proposed for speech recognition using frame-level speech features combined with attention-based long short-term memory (LSTM) recurrent neural networks. Frame-level speech features were extracted from waveform to replace traditional statistical features, which could preserve the timing relations in the original speech through the sequence of frames. To distinguish emotional saturation in different frames, two improvement strategies are proposed for LSTM based on the attention mechanism: first, the algorithm reduces the computational complexity by modifying the forgetting gate of traditional LSTM without sacrificing performance and second, in the final output of the LSTM, an attention mechanism is applied to both the time and the feature dimension to obtain the information related to the task, rather than using the output from the last iteration of the traditional algorithm. Extensive experiments on the CASIA, eNTERFACE, and GEMEP emotion corpora demonstrate that the performance of the proposed approach is able to outperform the state-of-the-art algorithms reported to date.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Article Multidisciplinary Sciences

Perception and classification of emotions in nonsense speech: Humans versus machines

Emilia Parada-Cabaleiro, Anton Batliner, Maximilian Schmitt, Markus Schedl, Giovanni Costantini, Bjoern Schuller

Summary: This article addresses four fallacies in traditional affective computing and proposes a more adequate modelling of emotions encoded in speech. The fallacies include limited focus on few emotions, lack of comparison between clean and noisy data, insufficient assessment of machine learning approaches, and the absence of strict comparison between human perception and machine classification. The article demonstrates the effectiveness of machine learning based on state-of-the-art feature representations in reflecting the main emotional categories even in degraded acoustic conditions.

PLOS ONE (2023)

Article Health Care Sciences & Services

Assessing the Feasibility of a Text-Based Conversational Agent for Asthma Support: Protocol for a Mixed Methods Observational Study

Rafael A. Calvo, Dorian Peters, Laura Moradbakhti, Darren Cook, Georgios Rizos, Bjoern Schuller, Constantinos Kallis, Ernie Wong, Jennifer Quint

Summary: This study aims to determine the feasibility and usability of a text-based conversational agent to assess asthma risk and provide information for improving asthma control. The study will recruit 300 adult participants through various channels and assess their asthma outcomes. The study is expected to be completed in 2023, and will inform future pilot studies and randomized controlled trials.

JMIR RESEARCH PROTOCOLS (2023)

Article Engineering, Biomedical

Exploring interpretable representations for heart sound abnormality detection

Zhihua Wang, Kun Qian, Houguang Liu, Bin Hu, Bjorn W. Schuller, Yoshiharu Yamamoto

Summary: The advantages of non-invasive, real-time and convenient computer audition-based heart sound abnormality detection methods have attracted increasing attention from the cardiovascular diseases community. A comprehensive investigation on time-frequency methods for analyzing heart sounds is proposed, considering the urgent need for robust detection algorithms in real environments. Experimental results show that Stockwell transformation outperforms other methods with the highest overall score of 65.2%, and the interpretable results demonstrate its ability to provide more information and noise robustness for heart sounds.

BIOMEDICAL SIGNAL PROCESSING AND CONTROL (2023)

Article Computer Science, Artificial Intelligence

Classification of stuttering-The ComParE challenge and beyond

Sebastian P. Bayerl, Maurice Gerczuk, Anton Batliner, Christian Bergler, Shahin Amiriparian, Bjoern Schuller, Elmar Noeth, Korbinian Riedhammer

Summary: The ACM Multimedia 2022 Computational Paralinguistics Challenge (ComParE) focused on the classification of stuttering, aiming to raise awareness and engage a wider research community. Stuttering is a complex speech disorder characterized by blocks, prolongations, and repetitions in speech. Accurate classification of stuttering symptoms is important for the development of self-help tools and specialized automatic speech recognition systems. This paper reviews the challenge contributions, presents improved state-of-the-art classification results, and explores cross-language training using the KSF-C dataset.

COMPUTER SPEECH AND LANGUAGE (2023)

Article Computer Science, Artificial Intelligence

Will Affective Computing Emerge From Foundation Models and General Artificial Intelligence? A First Evaluation of ChatGPT

Mostafa Amin, Erik W. Cambria, Bjorn Schuller

Summary: ChatGPT demonstrates the potential of general artificial intelligence capabilities and performs well across various natural language processing tasks. This study evaluates ChatGPT's text classification abilities for affective computing problems including personality prediction, sentiment analysis, and suicide tendency detection. Results show that task-specific RoBERTa models generally outperform other baselines, while ChatGPT performs decently and is comparable to Word2Vec and BoW baselines. ChatGPT exhibits robustness against noisy data, outperforming Word2Vec in such scenarios. The study concludes that ChatGPT is a good generalist model but not as specialized as task-specific models for optimal performance.

IEEE INTELLIGENT SYSTEMS (2023)

Article Acoustics

Robust Audio Watermarking Based on Empirical Mode Decomposition and Group Differential Relations

Wen-Hsing Lai, Tsung-Yuan Chou, Meng-Chen Chou, Bjoern W. Schuller

Summary: This paper proposes an audio watermarking technique using Complementary Ensemble Empirical Mode Decomposition and group differential relations. The technique achieves near-imperceptibility and robustness under various attacks, and the experimental results validate its effectiveness.

JOURNAL OF THE AUDIO ENGINEERING SOCIETY (2023)

Article Biology

Automated acoustic detection of Geoffroy's spider monkey highlights tipping points of human disturbance

Jenna Lawson, George Rizos, Dui Jasinghe, Andrew Whitworth, Bjoern Schuller, Cristina Banks-leite

Summary: With the increased human activity and threatened species at risk of extinction, it is important to understand how to conserve them across human-modified landscapes. Passive acoustic monitoring (PAM) is an efficient method for collecting data on vocal species, but there is a lack of automated species detectors to analyze large amounts of acoustic data. In this study, we used PAM and a newly developed automated detector to successfully detect the endangered Geoffroy's spider monkey and found that they were absent below a certain forest cover threshold and near primary paved roads, and occurred equally in old growth and secondary forests.

PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES (2023)

Article Computer Science, Artificial Intelligence

Automated composition of Galician Xota-tuning RNN-based composers for specific musical styles using deep Q-learning

Rodrigo Mira, Eduardo Coutinho, Emilia Parada-Cabaleiro, Bjoern W. Schuller

Summary: Music composition is challenging to automate due to the subjective nature of what is considered aesthetically pleasing. Past neural network-based methods have lacked consistency and failed to produce impressive results. In this project, we built upon Magenta's RL Tuner model and extended it to emulate the Galician Xota genre. By implementing a new rule-set and training a Deep Q Network using reward functions, we effectively enforced the desired style and structure on the generated compositions. Our research methodology provides a solid foundation for future studies using this architecture, and we propose further applications and improvements for this model in future work.

PEERJ COMPUTER SCIENCE (2023)

Article Computer Science, Artificial Intelligence

Can ChatGPT's Responses Boost Traditional Natural Language Processing?

Mostafa M. Amin, Erik Cambria, Bjoern W. Schuller

Summary: The employment of foundation models is expanding and ChatGPT has the potential to enhance existing NLP techniques with its novel knowledge.

IEEE INTELLIGENT SYSTEMS (2023)

Article Computer Science, Artificial Intelligence

Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap

Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf, Maximilian Schmitt, Felix Burkhardt, Florian Eyben, Bjoern W. W. Schuller

Summary: Recent advances in transformer-based architectures have shown promise in several machine learning tasks, specifically speech emotion recognition (SER) in the audio domain. However, existing works have not thoroughly evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. This study conducts a thorough analysis on pre-trained variants of wav2vec 2.0 and HuBERT, demonstrating their top performance for valence prediction without explicit linguistic information, and releasing the best performing model to the community for reproducibility.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Article Clinical Neurology

Multilingual markers of depression in remotely collected speech samples: A preliminary analysis

Nicholas Cummins, Judith Dineley, Pauline Conde, Faith Matcham, Sara Siddi, Femke Lamers, Ewan Carr, Grace Lavelle, Daniel Leightley, Katie M. White, Carolin Oetzmann, Edward L. Campbell, Sara Simblett, Stuart Bruce, Josep Maria Haro, Brenda W. J. H. Penninx, Yatharth Ranjan, Zulqarnain Rashid, Callum Stewart, Amos A. Folarin, Raquel Bailon, Bjoern W. Schuller, Til Wykes, Srinivasan Vairavan, Richard J. B. Dobson, Vaibhav A. Narayan, RADAR-CNS Consortium

Summary: Speech rate, articulation rate, and intensity of speech are associated with depressive symptoms, suggesting that these speech features may serve as biomarkers for major depressive disorder (MDD). This study collected real-world data, providing significant insights into the onset and progress of MDD.

JOURNAL OF AFFECTIVE DISORDERS (2023)

Article Computer Science, Artificial Intelligence

Audio-Visual Gated-Sequenced Neural Networks for Affect Recognition

Decky Aspandi, Federico Sukno, Bjorn W. Schuller, Xavier Binefa

Summary: There is growing interest in automatic emotion recognition and affective computing. The use of large video-based affect datasets has facilitated the development of deep learning-based models for automatic affect analysis. However, current approaches to process these multimodal inputs are oversimplified and fail to fully exploit their potential. This work proposes a multi-modal, sequence-based neural network with gating mechanisms for affect recognition, achieving state of the art accuracy on two affect datasets.

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING (2023)

Editorial Material Computer Science, Artificial Intelligence

Guest Editorial Neurosymbolic AI for Sentiment Analysis

Frank Xing, Bjoern Schuller, Iti Chaturvedi, Erik Cambria, Amir Hussain

Summary: Neural network-based methods, such as word2vec and GPT-based models, have achieved significant progress in AI research, especially in handling large datasets. However, these methods lack in-depth understanding of the internal features and representations of the data, leading to various problems and concerns.

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING (2023)

Article Computer Science, Artificial Intelligence

Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition

Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Bjorn Schuller

Summary: Despite recent advancements in speech emotion recognition (SER) within a single corpus, the performance of these systems degrades significantly for cross-corpus and cross-language scenarios. This is due to the lack of generalization in SER systems towards unseen conditions. Adversarial methods have been used to address this issue, but many only focus on cross-corpus SER and ignore the cross-language performance degradation. This study proposes an adversarial dual discriminator (ADDi) network and a self-supervised ADDi (sADDi) network to improve cross-corpus and cross-language SER without requiring target data labels. Experimental results demonstrate improved performance compared to state-of-the-art methods.

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING (2023)

Article Computer Science, Artificial Intelligence

FENP: A Database of Neonatal Facial Expression for Pain Analysis

Jingjie Yan, Guanming Lu, Xiaonan Li, Wenming Zheng, Chengwei Huang, Zhen Cui, Yuan Zong, Mengying Chen, Qiang Hao, Yi Liu, Jindu Zhu, Haibo Li

Summary: In this article, a new neonatal facial expression database for pain analysis is introduced. The database, called facial expression of neonatal pain (FENP), consists of 11,000 neonatal facial expression images associated with 106 Chinese neonates. The experimental results show that the proposed database is suitable for studying neonatal pain and facial expression recognition.

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING (2023)

暂无数据