4.5 Review

Statistical parametric speech synthesis

期刊

SPEECH COMMUNICATION
卷 51, 期 11, 页码 1039-1064

出版社

ELSEVIER
DOI: 10.1016/j.specom.2009.04.004

关键词

Speech synthesis; Unit selection; Hidden Markov models

资金

  1. Ministry of Education, Culture, Sports, Science and Technology (MEXT)
  2. Hori information science promotion foundation
  3. JSPS [1880009]
  4. European Community's Seventh Framework Programme [FP7/2007-2013]
  5. US National Science Foundation [0415021]
  6. Direct For Computer & Info Scie & Enginr
  7. Div Of Information & Intelligent Systems [0415021] Funding Source: National Science Foundation

向作者/读者索取更多资源

This review gives a general overview of techniques used in statistical parametric speech synthesis. One instance of these techniques, called hidden Markov model (HMM)-based speech synthesis, has recently been demonstrated to be very effective in synthesizing acceptable speech. This review also contrasts these techniques with the more conventional technique of unit-selection synthesis that has dominated speech synthesis over the last decade. The advantages and drawbacks of statistical parametric synthesis are highlighted and we identify where we expect key developments to appear in the immediate future. (C) 2009 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Review Engineering, Electrical & Electronic

Deep Learning for Acoustic Modeling in Parametric Speech Generation

Zhen-Hua Ling, Shi-Yin Kang, Heiga Zen, Andrew Senior, Mike Schuster, Xiao-Jun Qian, Helen Meng, Li Deng

IEEE SIGNAL PROCESSING MAGAZINE (2015)

Article Acoustics

Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

Heiga Zen, Norbert Braunschweiler, Sabine Buchholz, Mark J. F. Gales, Kate Knill, Sacha Krstulovic, Javier Latorre

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (2012)

Article Acoustics

Autoregressive Models for Statistical Parametric Speech Synthesis

Matt Shannon, Heiga Zen, William Byrne

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (2013)

Article Engineering, Electrical & Electronic

Speech Synthesis Based on Hidden Markov Models

Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, Keiichiro Oura

PROCEEDINGS OF THE IEEE (2013)

Article Engineering, Electrical & Electronic

Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques

Reinhold Haeb-Umbach, Shinji Watanabe, Tomohiro Nakatani, Michiel Bacchiani, Bjoern Hoffmeister, Michael L. Seltzer, Heiga Zen, Mehrez Souden

IEEE SIGNAL PROCESSING MAGAZINE (2019)

Proceedings Paper Audiology & Speech-Language Pathology

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang, Yonghui Wu

Summary: PnG BERT is a new encoder model for neural TTS that incorporates phoneme and grapheme representations as input, resulting in more natural prosody and accurate pronunciation. Experimental results demonstrate that a neural TTS model pre-trained with PnG BERT outperforms baseline models.

INTERSPEECH 2021 (2021)

Proceedings Paper Audiology & Speech-Language Pathology

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan

Summary: WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis that generates high fidelity audio through an iterative refinement process and allows for a trade-off between inference speed and sample quality by adjusting the number of refinement steps. Experiments show that it approaches the performance of state-of-the-art neural TTS systems.

INTERSPEECH 2021 (2021)

Proceedings Paper Audiology & Speech-Language Pathology

Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation

Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Heiga Zen, Mohammadreza Ghodsi, Yinghui Huang, Jesse Emond, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno

Summary: Semi and self-supervised training techniques can improve speech recognition performance without additional transcribed speech data. This study demonstrates the efficacy of two approaches by leveraging unspoken text and untranscribed audio, reducing word error rate in Indic language voice search tasks by up to 14.4%.

INTERSPEECH 2021 (2021)

Proceedings Paper Audiology & Speech-Language Pathology

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, R. J. Skerry-Ryan, Yonghui Wu

Summary: This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model that can learn token-frame alignments and durations automatically. Experimental results show that Parallel Tacotron 2 outperforms baselines in subjective naturalness in several diverse multi-speaker evaluations.

INTERSPEECH 2021 (2021)

Proceedings Paper Acoustics

FULLY-HIERARCHICAL FINE-GRAINED PROSODY MODELING FOR INTERPRETABLE SPEECH SYNTHESIS

Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (2020)

Proceedings Paper Acoustics

GENERATING DIVERSE AND NATURAL TEXT-TO-SPEECH SAMPLES USING A QUANTIZED FINE-GRAINED VAE AND AUTOREGRESSIVE PROSODY PRIOR

Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (2020)

Proceedings Paper Computer Science, Artificial Intelligence

Sequence-to-Sequence Neural Network Model with 2D Attention for Learning Japanese Pitch Accents

Antoine Bruguier, Heiga Zen, Arkady Arkhangorodsky

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES (2018)

Proceedings Paper Acoustics

Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

Heiga Zen, Yannis Agiomyrgiannakis, Niels Egberts, Fergus Henderson, Przemyslaw Szczepaniak

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES (2016)

Proceedings Paper Acoustics

Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN based Statistical Parametric Speech Synthesis

Bo Li, Heiga Zen

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES (2016)

Proceedings Paper Acoustics

DIRECTLY MODELING VOICED AND UNVOICED COMPONENTS IN SPEECH WAVEFORMS BY NEURAL NETWORKS

Keiichi Tokuda, Heiga Zen

2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS (2016)

Article Acoustics

Compact deep neural networks for real-time speech enhancement on resource-limited devices

Fazal E. Wahab, Zhongfu Ye, Nasir Saleem, Rizwan Ullah

Summary: This study presents a compact neural model designed in a complex frequency domain for real-time speech enhancement. The proposed model outperforms benchmark models and improves speech quality and intelligibility. The incorporation of attention-gate-based skip connections further enhances the performance.

SPEECH COMMUNICATION (2024)