4.5 Article

Learning from noisy out-of-domain corpus using dataless classification

期刊

NATURAL LANGUAGE ENGINEERING
卷 28, 期 1, 页码 39-69

出版社

CAMBRIDGE UNIV PRESS
DOI: 10.1017/S1351324920000340

关键词

Text classification; Dataless classification; Noisy labels; Domain adaptation

资金

  1. 100th Anniversary Chulalongkorn University Fund for Doctoral Scholarship
  2. 90th Anniversary Chulalongkorn University Fund (Ratchadaphiseksomphot Endowment Fund)

向作者/读者索取更多资源

This study proposes a two-stage approach to mitigate the lack of accurately labelled documents in text classification. Representative keywords are mined from a noisy out-of-domain data set using statistical methods, and a dataless classification method is applied to learn from selected keywords and unlabelled in-domain data. The proposed approach outperforms supervised learning and dataless classification baselines, and in-depth analysis explains its superiority.
In real-world applications, text classification models often suffer from a lack of accurately labelled documents. The available labelled documents may also be out of domain, making the trained model not able to perform well in the target domain. In this work, we mitigate the data problem of text classification using a two-stage approach. First, we mine representative keywords from a noisy out-of-domain data set using statistical methods. We then apply a dataless classification method to learn from the automatically selected keywords and unlabelled in-domain data. The proposed approach outperformed various supervised learning and dataless classification baselines by a large margin. We evaluated different keyword selection methods intrinsically and extrinsically by measuring their impact on the dataless classification accuracy. Last but not least, we conducted an in-depth analysis of the behaviour of the classifier and explained why the proposed dataless classification method outperformed supervised learning counterparts.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据