4.7 Article

Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition

Journal

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/JSTSP.2009.2039171

Keywords

Automatic speech recognition (ASR); compressive sensing (CS); missing data techniques; noise robustness

Ask authors/readers for more resources

An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable ( missing), and to replace ( impute) the missing ones by clean speech estimates. Conventional imputation techniques employ parametric models and impute the missing features on a frame-by-frame basis. At low signal-to-noise ratios (SNRs), these techniques fail, because too many time frames may contain few, if any, reliable features. In this paper, we introduce a novel non-parametric, exemplar-based method for reconstructing clean speech from noisy observations, based on techniques from the field of Compressive Sensing. The method, dubbed sparse imputation, can impute missing features using larger time windows such as entire words. Using an overcomplete dictionary of clean speech exemplars, the method finds the sparsest combination of exemplars that jointly approximate the reliable features of a noisy utterance. That linear combination of clean speech exemplars is used to replace the missing features. Recognition experiments on noisy isolated digits show that sparse imputation outperforms conventional imputation techniques at SNR = - dB when using an ideal 'oracle' mask. With error-prone estimated masks sparse imputation performs slightly worse than the best conventional technique.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Computer Science, Artificial Intelligence

Low resource end-to-end spoken language understanding with capsule networks

Jakob Poncelet, Vincent Renkens, Hugo Van hamme

Summary: Designing an Spoken Language Understanding (SLU) system for command-and-control applications presents challenges. The proposed end-to-end SLU system utilizes capsule networks for training new commands directly from user demonstrations, showing superior performance to baseline systems and versatility in multitask learning.

COMPUTER SPEECH AND LANGUAGE (2021)

Article Acoustics

Multi-encoder attention-based architectures for sound recognition with partial visual assistance

Wim Boes, Hugo Van Hamme

Summary: This study demonstrates a multi-encoder framework for dealing with the issue of incomplete content in sound recognition. The proposed method successfully incorporates partially available visual information into the operational procedures of the network, leading to improved predictions.

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING (2022)

Review Engineering, Biomedical

Relating EEG to continuous speech using deep neural networks: a review

Corentin Puffay, Bernd Accou, Lies Bollens, Mohammad Jalilpour Monesi, Jonas Vanthornhout, Hugo Van Hamme, Tom Francart

Summary: This paper reviews deep-learning-based studies that relate EEG to continuous speech, addressing methodological pitfalls and the need for a standard benchmark of model analysis.

JOURNAL OF NEURAL ENGINEERING (2023)

Article Engineering, Biomedical

Robust neural tracking of linguistic speech representations using a convolutional neural network

Corentin Puffay, Jonas Vanthornhout, Marlies Gillis, Bernd Accou, Hugo Van Hamme, Tom Francart

Summary: This study aims to investigate neural signal tracking in continuous speech and finds that the non-linear CNN model performs better than linear models in measuring the coding of linguistic features.

JOURNAL OF NEURAL ENGINEERING (2023)

Article Chemistry, Multidisciplinary

Bidirectional Representations for Low-Resource Spoken Language Understanding

Quentin Meeus, Marie-Francine Moens, Hugo Van Hamme

Summary: The featured application models proposed in this research provide a spoken command interface for applications with limited and expensive training data. By introducing a transformer encoder-decoder framework and a multiobjective training strategy, the model is able to learn contextual bidirectional representations, resulting in improved performance. Additionally, the concept of class attention is introduced as an efficient module for spoken language understanding, providing explanations for model predictions and enhancing understanding of decision-making processes.

APPLIED SCIENCES-BASEL (2023)

Proceedings Paper Acoustics

ANALYSIS OF XLS-R FOR SPEECH QUALITY ASSESSMENT

Bastiaan Tamm, Rik Vandenberghe, Hugo Van Hamme

Summary: This paper focuses on the performance analysis of a pre-trained model in speech quality assessment. The study identifies two optimal regions, lower-level features and high-level features, and explores their differences and potential reasons. Additionally, the paper attempts to fuse the two optimal feature depths and assesses the performance of the proposed models.

2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA (2023)

Proceedings Paper Computer Science, Artificial Intelligence

LEARNING TO JOINTLY TRANSCRIBE AND SUBTITLE FOR END-TO-END SPONTANEOUS SPEECH RECOGNITION

Jakob Poncelet, Hugo Van Hamme

Summary: TV subtitles provide valuable resources for transcribing various types of speech, but cannot be directly used to improve Automatic Speech Recognition (ASR) models. A multitask dual-decoder Transformer model is proposed to perform ASR and automatic subtitling simultaneously. By training and effectively utilizing subtitle data, improvements are achieved in both regular ASR and spontaneous/conversational ASR.

2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT (2022)

Proceedings Paper Computer Science, Artificial Intelligence

WEAK-SUPERVISED DYSARTHRIA-INVARIANT FEATURES FOR SPOKEN LANGUAGE UNDERSTANDING USING AN FHVAE AND ADVERSARIAL TRAINING

Jinzi Qi, Hugo Van hamme

Summary: This paper focuses on improving the accuracy and generalization ability of spoken language understanding systems for dysarthric speech by introducing adversarial training and weakly supervised feature extraction. The results show that this approach can achieve higher accuracy, especially for speakers with severe dysarthria, in spoken language understanding tasks.

2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT (2022)

Proceedings Paper Acoustics

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Bastiaan Tamm, Helena Balabin, Rik Vandenberghe, Hugo Van Hamme

Summary: The quality of speech in online conferencing applications is often evaluated using mean opinion score (MOS) through human judgments. However, this approach is not suitable for large-scale assessments. Researchers have turned to automated MOS prediction using deep neural networks. In this study, a feature extractor based on the XLS-R model is proposed to improve fine-tuning by reducing the number of trainable model parameters. The performance of the pre-trained XLS-R embeddings is compared to MFCC-based feature extractor on the ConferencingSpeech 2022 MOS prediction task, with a reduced root mean squared error (RMSE) observed.

INTERSPEECH 2022 (2022)

Proceedings Paper Acoustics

Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding

Pu Wang, Hugo Van Hamme

Summary: In this study, a lean Transformer structure is proposed to build compact spoken language understanding models in low-resource settings. The dimension of attention mechanism is reduced automatically using group sparsity, and the learned attention subspace is transferred to an attention bottleneck layer, achieving competitive accuracies without pre-training.

INTERSPEECH 2022 (2022)

Proceedings Paper Acoustics

MULTITASK LEARNING FOR LOW RESOURCE SPOKEN LANGUAGE UNDERSTANDING

Quentin Meeus, Marie Francine Moens, Hugo Van Hamme

Summary: In this paper, we investigate the advantages of multitask learning in speech processing by training models on dual objectives, including automatic speech recognition, intent classification, and sentiment classification. Our experiments show that multitask learning can improve model performance, especially in low-resource scenarios. The results demonstrate that multitask learning can effectively compete with and even outperform baseline models in Dutch and English domains.

INTERSPEECH 2022 (2022)

Proceedings Paper Acoustics

Relating the fundamental frequency of speech with EEG using a dilated convolutional network

Corentin Puffay, Jana Van Canneyt, Jonas Vanthornhout, Hugo Van Hamme, Tom Francart

Summary: This study investigates how speech is processed in the brain using nonlinear models, focusing on the fundamental frequency (f0) feature and the speech envelope. The results show that combining f0 and the speech envelope improves the performance of the state-of-the-art envelope-based model, and the dilated-convolutional model can generalize well to subjects not included during training.

INTERSPEECH 2022 (2022)

Proceedings Paper Acoustics

Continual Learning for Monolingual End-to-End Automatic Speech Recognition

Steven Vander Eeckt, Hugo Van Hamme

Summary: This paper discusses the use of Continual Learning (CL) methods to overcome the deterioration in performance of Automatic Speech Recognition (ASR) models when adapted to new domains. The study found that the best performing CL method improved the model's performance by over 40% when extending it to new tasks, using only 0.6% of the original data.

2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022) (2022)

Proceedings Paper Acoustics

LEARNING SUBJECT-INVARIANT REPRESENTATIONS FROM SPEECH-EVOKED EEG USING VARIATIONAL AUTOENCODERS

Lies Bollens, Tom Francart, Hugo Van Hamme

Summary: The electroencephalogram (EEG) is a powerful method to understand how the brain processes speech. Deep neural networks have replaced linear models and shown promising results in this field. This study utilizes factorized hierarchical variational autoencoders to analyze parallel EEG recordings.

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) (2022)

Proceedings Paper Computer Science, Interdisciplinary Applications

Cross-lingual Detection of Dysphonic Speech for Dutch and Hungarian Datasets

David Sztaho, Miklos Gabriel Tulics, Jinzi Qi, Hugo Van Hamme, Klara Vicsi

Summary: This paper focuses on the extension of detecting dysphonic voices in a cross-lingual scenario. The study uses Hungarian and Dutch speech datasets to perform automatic separation and severity level estimation. The results show that cross-lingual detection is possible with acceptable generalization ability, and features calculated on phoneme-level can improve the results.

BIOSIGNALS: PROCEEDINGS OF THE 15TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES - VOL 4: BIOSIGNALS (2022)

No Data Available