4.7 Article

E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition

Journal

EXPERT SYSTEMS WITH APPLICATIONS
Volume 222, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2023.119797

Keywords

Dysarthria; Dysarthric ASR; Speech intelligibility; Words error; Multi-head transformer; CNN

Ask authors/readers for more resources

Dysarthria is a speech disability caused by weak muscles and organs involved in articulation, affecting speech intelligibility. This paper proposes a visual dysarthric ASR system using SCNN and MHAT to overcome speech challenges. The DASR system outperformed other systems, improving recognition accuracy for the UA-Speech database by 20.72%, with the largest improvements seen in very-low (25.75%) and low intelligibility (33.67%) cases.
Dysarthria is a motor speech disability caused by weak muscles and organs involved in the articulation process, thereby affecting the speech intelligibility of individuals. Because this condition is linked to physical exhaustion disabilities, individuals not only have communication difficulties, but also have difficulty interacting with digital devices. Automatic speech recognition (ASR) makes an important difference for individuals with dysarthria since modern digital devices offer a better interaction medium that enables them to interact with their community and computers. Still, the performance of ASR technologies is poor in recognizing dysarthric speech, particularly for acute dysarthria. Multiple challenges, including dysarthric phoneme inaccuracy and labeling imperfection, are facing dysarthric ASR technologies. This paper proposes a spatio-temporal dysarthric ASR (DASR) system using Spatial Convolutional Neural Network (SCNN) and Multi-Head Attention Transformer (MHAT) to visually extract the speech features, and DASR learns the shapes of phonemes pronounced by dysarthric individuals. This visual DASR feature modeling eliminates phoneme-related challenges. The UA-Speech database is utilized in this paper, including different speakers with different speech intelligibility levels. However, because the proportion of us-able speech data to the number of distinctive classes in the UA-speech database was small, the proposed DASR system leverages transfer learning to generate synthetic leverage and visuals. In benchmarking with other DASRs examined in this study, the proposed DASR system outperformed and improved the recognition accuracy for 20.72% of the UA-Speech database. The largest improvements were achieved for very-low (25.75%) and low intelligibility (33.67%).

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available