☆ 4.6 Article

A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition

NEUROCOMPUTING (2016)

Journal

NEUROCOMPUTING

Volume 218, Issue -, Pages 448-459

Publisher

ELSEVIER

DOI: 10.1016/j.neucom.2016.09.018

Keywords

Transfer learning; Speaker adaptation; Deep neural network; Multi-task learning

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

In this paper, we present a unified approach to transfer learning of deep neural networks (DNNs) to address performance degradation issues caused by a potential acoustic mismatch between the training and testing conditions due to inter-speaker variability in state-of-the-art connectionist (a.k.a., hybrid) automatic speech recognition (ASR) systems. Different schemes to transfer knowledge of deep neural networks related to speaker adaptation can be developed with ease under such a unifying concept as demonstrated in the three frameworks investigated in this study. In the first solution, knowledge is transferred between homogeneous domains, namely the source and the target domains. Moreover the transfer takes place in a sequential manner from the target to the source speaker to boost the ASR accuracy on spoken utterances from a surprise target speaker. In the second solution, a multi-task approach is adopted to adjust the connectionist parameters to improve the ASR system performance on the target speaker. Knowledge is transferred simultaneously among heterogeneous tasks, and that is achieved by adding one or more smaller auxiliary output layers to the original DNN structure. In the third solution, DNN output classes are organised into a hierarchical structure in order to adjust the connectionist parameters and close the gap between training and testing conditions by transferring prior knowledge from the root node to the leaves in a structural maximum a posteriori fashion. Through a series of experiments on the Wall Street Journal (WSJ) speech recognition task, we show that the proposed solutions result in consistent and statistically significant word error rate reductions. Most importantly, we show that transfer learning is an enabling technology for speaker adaptation, since it outperforms both the transformation-based adaptation algorithms usually adapted in the speech community, and the multi-condition training (MCT) schemes, which is a data combination methods often adopted to cover more acoustic variabilities in speech when data from the source and target domains are both available at the training time. Finally, experimental evidence demonstrates that all proposed solutions are robust to negative transfer even when only a single sentence from the target speaker is available. (C) 2016 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6

Not enough ratings

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

Kurniawati Azizah, Wisnu Jatmiko

Summary: This study proposes a novel training strategy and speech synthesis model to address the issues of data scarcity in low-resource languages and unsatisfactory performance in zero-shot speaker adaptation. Through the use of multi-stage transfer learning and explicit style control, the proposed model successfully improves the intelligibility of synthesized speech and speaker similarity.

IEEE ACCESS (2022)