4.7 Article

Cost-Sensitive Multi-Label Learning for Audio Tag Annotation and Retrieval

期刊

IEEE TRANSACTIONS ON MULTIMEDIA
卷 13, 期 3, 页码 518-529

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMM.2011.2129498

关键词

Audio tag annotation; audio tag retrieval; cost-sensitive learning; multi-label; tag count

资金

  1. National Science Council of Taiwan [NSC99-2631-H-001-020]

向作者/读者索取更多资源

Audio tags correspond to keywords that people use to describe different aspects of a music clip. With the explosive growth of digital music available on the Web, automatic audio tagging, which can be used to annotate unknown music or retrieve desirable music, is becoming increasingly important. This can be achieved by training a binary classifier for each tag based on the labeled music data. Our method that won the MIREX 2009 audio tagging competition is one of this kind of methods. However, since social tags are usually assigned by people with different levels of musical knowledge, they inevitably contain noisy information. By treating the tag counts as costs, we can model the audio tagging problem as a cost-sensitive classification problem. In addition, tag correlation information is useful for automatic audio tagging since some tags often co-occur. By considering the co-occurrences of tags, we can model the audio tagging problem as a multi-label classification problem. To exploit the tag count and correlation information jointly, we formulate the audio tagging task as a novel cost-sensitive multi-label (CSML) learning problem and propose two solutions to solve it. The experimental results demonstrate that the new approach outperforms our MIREX 2009 winning method.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Article Engineering, Electrical & Electronic

SVSNet: An End-to-End Speaker Voice Similarity Assessment Model

Cheng-Hung Hu, Yu-Huai Peng, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang

Summary: This paper introduces SVSNet, the first end-to-end neural network model for assessing speaker voice similarity in voice conversion tasks. Unlike most neural evaluation metrics, SVSNet takes raw waveform as input to make full use of speech information. Experimental results on VCC2018 and VCC2020 datasets show that SVSNet outperforms baseline systems in assessing speaker similarity at both utterance and system levels.

IEEE SIGNAL PROCESSING LETTERS (2022)

Article Acoustics

Generalization Ability Improvement of Speaker Representation and Anti-Interference for Speaker Verification

Qian-Bei Hong, Chung-Hsien Wu, Hsin-Min Wang

Summary: In this paper, two novel approaches are proposed to improve the generalization ability of speaker verification and reduce interference from other speakers. Experimental results show that these methods can significantly enhance system performance.

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (2023)

Article Acoustics

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

Summary: This study proposes MOSA-Net, a cross-domain multi-objective speech assessment model that can estimate speech quality, intelligibility, and distortion assessment scores simultaneously. Experimental results show that MOSA-Net improves the prediction of speech quality and short-time objective intelligibility compared to existing single-task models. Moreover, MOSA-Net can be effectively adapted to predict subjective quality and intelligibility scores with limited training data. The proposed QIA-SE approach, guided by MOSA-Net's latent representations, also outperforms the baseline SE system in terms of PESQ scores.

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (2023)

Article Engineering, Electrical & Electronic

Multi-Target Extractor and Detector for Unknown-Number Speaker Diarization

Chin-Yi Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

Summary: This study proposes a neural architecture that extracts speaker representations and detects the presence of each speaker on a frame-by-frame basis, regardless of the number of speakers in a conversation. The model outperforms previous methods in tests on the CALLHOME corpus and achieves significant diarization error rate reductions in a more challenging case with simultaneous speakers ranging from 2 to 7.

IEEE SIGNAL PROCESSING LETTERS (2023)

Article Acoustics

Decomposition and Reorganization of Phonetic Information for Speaker Embedding Learning

Qian-Bei Hong, Chung-Hsien Wu, Hsin-Min Wang

Summary: In this paper, a novel architecture based on self-constraint learning (SCL) and reconstruction task (RT) is proposed to remove the influence of phonetic information on speaker embedding generation. Experimental results show that the proposed DROP-TDNN system outperforms the state-of-the-art ECAPA-TDNN system on multiple datasets.

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (2023)

Proceedings Paper Computer Science, Artificial Intelligence

Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Hung-Shin Lee, Pin-Yuan Chen, Yao-Fei Cheng, Yu Tsao, Hsin-Min Wang

Summary: A noise-aware training framework based on two cascaded neural structures is proposed in this paper to jointly optimize speech enhancement and speech recognition, achieving a lower word error rate (WER).

2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) (2022)

Proceedings Paper Acoustics

Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Fan-Lin Wang, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

Summary: Because of the excellent performance of speech separation in cases of complete speaker overlap, the focus of research has shifted towards dealing with more realistic scenarios. However, domain mismatch between training and testing situations remains a significant problem due to various factors. This study investigates the impacts of language and channel mismatches on speech separation and proposes a new solution for channel mismatch using projection evaluation.

INTERSPEECH 2022 (2022)

Proceedings Paper Acoustics

MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids

Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

Summary: This study proposes a multi-branched speech intelligibility prediction model (MBI-Net) to predict the subjective intelligibility scores of hearing aid users. Experimental results confirm the effectiveness of MBI-Net, which produces higher prediction scores than the baseline system.

INTERSPEECH 2022 (2022)

Proceedings Paper Acoustics

MTI-Net: A Multi-Target Speech Intelligibility Prediction Model

Ryandhimas Edo Zezario, Szu-wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

Summary: This study proposes a multi-task speech intelligibility prediction model, called MTI-Net, for simultaneously predicting human subjective listening test results and word error rate (WER) scores. Experimental results demonstrate the effectiveness of using cross-domain features, multi-task learning, and fine-tuning SSL embeddings.

INTERSPEECH 2022 (2022)

Proceedings Paper Acoustics

The VoiceMOS Challenge 2022

Wen-Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi

Summary: The VoiceMOS Challenge aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. Through this challenge, 22 participating teams from academia and industry tested various approaches to predict human ratings of synthesized speech. The results highlight the effectiveness of fine-tuning self-supervised speech models for MOS prediction, as well as the challenges in predicting MOS ratings for unseen speakers, listeners, and systems in the out-of-domain setting.

INTERSPEECH 2022 (2022)

Proceedings Paper Acoustics

NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional Resampling

Chi-Chang Lee, Cheng-Hung Hu, Yu-Chen Lin, Chu-Song Chen, Hsin-Min Wang, Yu Tsao

Summary: In this paper, a method called NASTAR is proposed, which addresses the training-test acoustic mismatch issue in deep learning-based speech enhancement systems by using only one sample of noisy speech in the target environment. NASTAR utilizes a feedback mechanism to simulate adaptive training data and experimental results show its effectiveness in noise adaptation.

INTERSPEECH 2022 (2022)

Proceedings Paper Acoustics

Chain-based Discriminative Autoencoders for Speech Recognition

Hung-Shin Lee, Pin-Tuan Huang, Yao-Fei Cheng, Hsin-Min Wang

Summary: In this paper, we propose three new versions of a discriminative autoencoder (DcAE) for speech recognition, achieving superior experimental results.

INTERSPEECH 2022 (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Lip Sync Matters: A Novel Multimodal Forgery Detector

Sahibzada Adil Shahzad, Ammarah Hashmi, Sarwar Khan, Yan-Tsung Peng, Yu Tsao, Hsin-Min Wang

Summary: Deepfake technology has both positive and negative impacts on society. While there have been efforts to detect fake footage using unimodal deep learning models, this approach is insufficient for detecting multimodal manipulations. This study proposes a lip-reading-based multimodal Deepfake detection method called Lip Sync Matters, which shows superior performance in detecting forged videos.

PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Multimodal Forgery Detection Using Ensemble Learning

Ammarah Hashmi, Sahibzada Adil Shahzad, Wasim Ahmad, Chia Wen Lin, Yu Tsao, Hsin-Min Wang

Summary: This paper proposes a deep forgery detection method based on audiovisual ensemble learning for the task of multimodal forgery detection, achieving a high accuracy rate of 89% in experimental results.

PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Detecting Replay Attacks Using Single-Channel Audio: The Temporal Autocorrelation of Speech

Shih-Kuang Lee, Yu Tsao, Hsin-Min Wang

Summary: This paper proposes a new feature for replay detection, which utilizes the temporal auto-correlation of single-channel speech. The experimental results demonstrate that the proposed feature can effectively distinguish replay attacks, clean speech, and speech with simulated reverberation, and its utilization in a fusion system consistently improves performance. Moreover, the best fusion system achieves a zero equal error rate and a zero minimum tandem detection cost function for the first time on the development set.

PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) (2022)

暂无数据