4.7 Article

Automatic speech recognition of Portuguese phonemes using neural networks ensemble

Journal

EXPERT SYSTEMS WITH APPLICATIONS
Volume 229, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2023.120378

Keywords

Automatic speech recognition; Phonetic recognition; Artificial neural networks; Ensemble

Ask authors/readers for more resources

Automatic speech recognition based on phoneme detection provides advantages for online speech recognition. The development of such a system is multidisciplinary, involving linguistics, signal processing, and computational intelligence. This study proposes a novel approach that divides the decision space of speech recognition using an ensemble of neural network experts, leading to improved precision, sensitivity, and accuracy. A dynamic post-processing step is also employed to mitigate the oscillatory effect during recognition.
The automatic speech recognition based on detection of phonemes provides advantages for online recognition of a speech represented by a sound signal. The development of a system for automatic speech recognition is multidisciplinary. It covers several areas of research, such as linguistics, signal processing and computational intelligence. In this work, the process starts with a speech signal pre-processing to extract the main features of the speech signal at a given instant of time. Inspired by the divide and conquer principle, we bridge the complexity gap of automatic speech recognition by devising models based on an ensemble of neural network experts, allowing to divide the huge decision space regarding speech recognition so that each expert takes care only of a delimited area of this decision space. This novel application of this strategy improves the precision, sensitivity and accuracy of the recognition process. Each included expert decides regarding each one of the pre-processed input samples. The decision set thus obtained is weighted. So, the expert with the highest weight for the output will determine the sample final classification. After that, a dynamic post-processing step, implemented as a recurrent neural network, is executed. It aims at mitigating the oscillatory effect that occurs during the recognition of classes with similar characteristics. In this work, two ensembles are investigated. The first is based on the clustering of similar phonetics classes while the second takes care of the imbalanced distribution of samples in the training set. The proposed model achieves 7.63% improvement in terms of accuracy with respect to the best so far related model for automatic speech recognition.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available