☆ 4.6 Article

Data Quality Matters: A Case Study on Data Label Correctness for Security Bug Report Prediction

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING (2021)

期刊

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

卷 48, 期 7, 页码 2541-2556

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TSE.2021.3063727

关键词

Computer bugs; Noise measurement; Predictive models; Security; Chromium; Tuning; Data models; Security bug report prediction; data quality; label correctness

类别

Computer Science, Software Engineering Engineering, Electrical & Electronic

资金

Innovation Foundation for Doctor Dissertation of Northwestern Polytechnical University [CX202067]
Key Laboratory of Advanced Perception and Intelligent Control of High-end Equipment, Ministry of Education [GDSC202006]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study reveals the presence of mislabeled instances in datasets for security bug report prediction, which has negatively impacted the performance of previous models. However, after cleaning the datasets, the performance of classification models has significantly improved.

In the research of mining software repositories, we need to label a large amount of data to construct a predictive model. The correctness of the labels will affect the performance of a model substantially. However, limited studies have been performed to investigate the impact of mislabeled instances on a predictive model. To bridge the gap, in this article, we perform a case study on the security bug report (SBR) prediction. We found five publicly available datasets for SBR prediction contains many mislabeled instances, which lead to the poor performance of SBR prediction models of recent studies (e.g., the work of Peters et al. and Shu et al.). Furthermore, it might mislead the research direction of SBR prediction. In this article, we first improve the label correctness of these five datasets by manually analyzing each bug report, and we find 749 SBRs, which are originally mislabeled as Non-SBRs (NSBRs). We then evaluate the impacts of datasets label correctness by comparing the performance of the classification models on both the noisy (i.e., before our correction) and the clean (i.e., after our correction) datasets. The results show that the cleaned datasets result in improvement in the performance of classification models. The performance of the approaches proposed by Peters et al. and Shu et al. on the clean datasets is much better than on the noisy datasets. Furthermore, with the clean datasets, the simple text classification models could significantly outperform the security keywords-matrix-based approaches applied by Peters et al. and Shu et al.

Data Quality Matters: A Case Study on Data Label Correctness for Security Bug Report Prediction

期刊

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Data Quality Matters: A Case Study on Data Label Correctness for Security Bug Report Prediction

期刊

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文