4.1 Article

Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning

Journal

FUTURE INTERNET
Volume 15, Issue 5, Pages -

Publisher

MDPI
DOI: 10.3390/fi15050159

Keywords

machine learning; deep learning; deep neural networks; speech-to-text; automatic speech recognition; NVIDIA NeMo; GPUs; data-centric; Portuguese language

Ask authors/readers for more resources

This paper presents the optimization and evaluation of a deep learning automatic speech recognition (ASR) system for European Portuguese. A pipeline consisting of multiple stages such as data acquisition, analysis, pre-processing, model creation, and evaluation is presented. Transfer learning is employed, starting with an English-optimized model and adapting it for European Portuguese using a dataset containing mainly Brazilian Portuguese. Domain adaptation between European Portuguese and mixed Portuguese (mostly Brazilian) is investigated. The proposed optimization evaluation uses the NVIDIA NeMo framework with the QuartzNet15x5 architecture, achieving a state-of-the-art word error rate (WER) of 0.0503.
Automatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words. This paper presents a deep learning ASR system optimization and evaluation for the European Portuguese language. We present a pipeline composed of several stages for data acquisition, analysis, pre-processing, model creation, and evaluation. A transfer learning approach is proposed considering an English language-optimized model as starting point; a target composed of European Portuguese; and the contribution to the transfer process by a source from a different domain consisting of a multiple-variant Portuguese language dataset, essentially composed of Brazilian Portuguese. A domain adaptation was investigated between European Portuguese and mixed (mostly Brazilian) Portuguese. The proposed optimization evaluation used the NVIDIA NeMo framework implementing the QuartzNet15x5 architecture based on 1D time-channel separable convolutions. Following this transfer learning data-centric approach, the model was optimized, achieving a state-of-the-art word error rate (WER) of 0.0503.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.1
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available