4.6 Article

The Emotion Probe: On the Universality of Cross-Linguistic and Cross-Gender Speech Emotion Recognition via Machine Learning

Journal

SENSORS
Volume 22, Issue 7, Pages -

Publisher

MDPI
DOI: 10.3390/s22072461

Keywords

speech; emotion recognition; artificial intelligence; English; cross-linguistic; cross-gender; SVM; machine learning; SER

Ask authors/readers for more resources

This study investigates the feasibility and characteristics of cross-linguistic and cross-gender speech emotion recognition (SER). The results show that the MLP classifier is the most effective, with accuracies exceeding 90% for single-language approaches and over 80% for cross-language classification. Cross-gender tasks are found to be more challenging than tasks involving different languages, indicating significant differences in emotions expressed by male and female subjects. RASTA, F0, MFCC, and spectral energy are identified as the most effective feature domains.
Machine Learning (ML) algorithms within a human-computer framework are the leading force in speech emotion recognition (SER). However, few studies explore cross-corpora aspects of SER; this work aims to explore the feasibility and characteristics of a cross-linguistic, cross-gender SER. Three ML classifiers (SVM, Naive Bayes and MLP) are applied to acoustic features, obtained through a procedure based on Kononenko's discretization and correlation-based feature selection. The system encompasses five emotions (disgust, fear, happiness, anger and sadness), using the Emofilm database, comprised of short clips of English movies and the respective Italian and Spanish dubbed versions, for a total of 1115 annotated utterances. The results see MLP as the most effective classifier, with accuracies higher than 90% for single-language approaches, while the cross-language classifier still yields accuracies higher than 80%. The results show cross-gender tasks to be more difficult than those involving two languages, suggesting greater differences between emotions expressed by male versus female subjects than between different languages. Four feature domains, namely, RASTA, F0, MFCC and spectral energy, are algorithmically assessed as the most effective, refining existing literature and approaches based on standard sets. To our knowledge, this is one of the first studies encompassing cross-gender and cross-linguistic assessments on SER.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Environmental Sciences

An Exploratory Study on the Acoustic Musical Properties to Decrease Self-Perceived Anxiety

Emilia Parada-Cabaleiro, Anton Batliner, Markus Schedl

Summary: Musical listening is widely used to reduce anxiety, but the acoustic properties of anxiety-reducing music have not been thoroughly studied. This study explores whether the acoustic parameters used in music emotion recognition are also suitable for identifying music with relaxing properties. The results show that when using classical Western music to reduce anxiety, tonal music should be considered and harmonicity is an appropriate indicator of relaxing music. Further research is needed to understand the role of scoring and dynamics in reducing listener distress.

INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH (2022)

Correction Engineering, Electrical & Electronic

The perception of emotional cues by children in artificial background noise (vol 23, pg 169, 2020)

Emilia Parada-Cabaleiro, Anton Batliner, Alice Baird, Bjorn Schuller

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY (2022)

Article Computer Science, Information Systems

Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection

Shuo Liu, Adria Mallol-Ragolta, Tianhao Yan, Kun Qian, Emilia Parada-Cabaleiro, Bin Hu, Bjoern W. Schuller

Summary: This paper presents two effective neural network models to detect surgical masks from audio, which can extract more salient temporal information. By exploring the combination of LSTM and Transformers in three hybrid models, it is demonstrated that one of the hybrid models achieves the best performance.

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS (2022)

Article Computer Science, Artificial Intelligence

Deep learning and machine learning-based voice analysis for the detection of COVID-19: A proposal and comparison of architectures

Giovanni Costantini, Valerio Cesarini, Carlo Robotti, Marco Benazzo, Filomena Pietrantonio, Stefano Di Girolamo, Antonio Pisani, Pietro Canzi, Simone Mauramati, Giulia Bertino, Irene Cassaniti, Fausto Baldanti, Giovanni Saggio

Summary: A novel approach for COVID-19 assessment is adopted in this study, using different vocal tasks and two custom algorithms to identify COVID-19 positive and negative subjects as well as recovered individuals. The results suggest that this method could serve as an on-site screening tool, highlighting the ongoing relevance of traditional machine learning and deep learning in speech analysis.

KNOWLEDGE-BASED SYSTEMS (2022)

Article Computer Science, Artificial Intelligence

Machine learning- and statistical-based voice analysis of Parkinson?s disease patients: A survey

Federica Amato, Giovanni Saggio, Valerio Cesarini, Gabriella Olmo, Giovanni Costantini

Summary: The preliminary diagnosis and evaluation of Parkinson's disease is crucial. Real-time, non-invasive voice analysis enhanced by machine learning is gaining interest. This review aims to identify the most widely used feature-based machine learning methods and present their effectiveness. A total of 102 works and 5 review articles were selected, analyzing commonly used features, algorithms, datasets, and metadata. Jitter, Shimmer, Harmonic-to-Noise Ratio, Fundamental Frequency, and Mel Frequency Cepstral Coefficients were found to be the most adopted features, with a prevalence of glottal-like models and additional filtering options.

EXPERT SYSTEMS WITH APPLICATIONS (2023)

Article Chemistry, Analytical

High-Level CNN and Machine Learning Methods for Speaker Recognition

Giovanni Costantini, Valerio Cesarini, Emanuele Brenna

Summary: In this paper, the two different methodologies of deep learning and traditional machine learning are compared in speaker recognition task using the DEMoS dataset. The results show that a custom CNN trained on grayscale spectrogram images achieves the most accurate results, with an accuracy of 90.15% for grayscale spectrograms and 83.17% for colored MFCC.

SENSORS (2023)

Article Chemistry, Analytical

Artificial Intelligence-Based Voice Assessment of Patients with Parkinson's Disease Off and On Treatment: Machine vs. Deep-Learning Comparison

Giovanni Costantini, Valerio Cesarini, Pietro Di Leo, Federica Amato, Antonio Suppa, Francesco Asci, Antonio Pisani, Alessandra Calculli, Giovanni Saggio

Summary: This study analyzed the voice characteristics of Parkinson's disease patients using machine learning techniques, and compared different feature selection and classification algorithms. The results showed that both feature-based machine learning and deep learning achieved comparable results in terms of classification, with KNN, SVM, and naive Bayes classifiers performing similarly. The superiority of CFS as the best feature selector was more evident, and the selected features acted as relevant vocal biomarkers capable of differentiating healthy subjects, early untreated PD patients, and mid-advanced L-Dopa treated patients.

SENSORS (2023)

Proceedings Paper Computer Science, Cybernetics

LFM-2b: A Dataset of Enriched Music Listening Events for Recommender Systems Research and Fairness Analysis

Markus Schedl, Stefan Brandl, Oleg Lesota, Emilia Parada-Cabaleiro, David Penz, Navid Rekabsaz

Summary: The LFM-2b dataset contains the listening records of over 120,000 users on Last.fm, spanning 15 years and involving 50 million distinct tracks and 5 million distinct artists. In addition to common metadata, the dataset also includes demographic information of users and fine-grained genre, style, and lyrics embeddings of items. This rich dataset enables research on various recommender system algorithms and investigation of fairness aspects.

CHIIR'22: PROCEEDINGS OF THE 2022 CONFERENCE ON HUMAN INFORMATION INTERACTION AND RETRIEVAL (2022)

Proceedings Paper Engineering, Biomedical

Obesity and Gastro-Esophageal Reflux voice disorders: a Machine Learning approach

Federica Amato, Maria Fasani, Glauco Raffaelli, Valerio Cesarini, Gabriella Olmo, Nicola Di Lorenzo, Giovanni Costantini, Giovanni Saggio

Summary: Automatic assessment of the influence of obesity and GERD on voice and their mutual influence was conducted using vocal tests from 92 subjects. Machine Learning models achieved high accuracies in scoring the presence of GERD and obesity. Sentence repetition was found to be more effective than vowel phonation, and certain features such as Mel Frequency Cepstral Coefficients were identified as significant for this application.

2022 IEEE INTERNATIONAL SYMPOSIUM ON MEDICAL MEASUREMENTS AND APPLICATIONS (MEMEA 2022) (2022)

Proceedings Paper Computer Science, Interdisciplinary Applications

Machine Learning-based Study of Dysphonic Voices for the Identification and Differentiation of Vocal Cord Paralysis and Vocal Nodules

Valerio Cesarini, Carlo Robotti, Ylenia Piromalli, Francesco Mozzanica, Antonio Schindler, Giovanni Saggio, Giovanni Costantini

Summary: This study developed a machine-learning framework to automatically identify and differentiate dysphonic voices. The framework achieved high accuracy and differentiation rates, suggesting the potential for distinguishing the etiologies of dysphonia. The analysis also highlighted a trend of poor volume control in dysphonic subjects, refining existing literature.

BIOSIGNALS: PROCEEDINGS OF THE 15TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES - VOL 4: BIOSIGNALS (2022)

No Data Available