☆ 4.7 Article

Effectiveness of resampling methods in coping with imbalanced crash data: Crash type analysis and predictive modeling

ACCIDENT ANALYSIS AND PREVENTION (2021)

期刊

ACCIDENT ANALYSIS AND PREVENTION

卷 159, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.aap.2021.106240

关键词

Machine learning; Gradient boosting; Tree ensemble; Nested logit; Traffic crash; Resampling; Over-sampling; Data imbalance

类别

Ergonomics Public, Environmental & Occupational Health Social Sciences, Interdisciplinary Transportation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study focuses on comparing the effects of resampling techniques on the classification and prediction of different crash types on freeways using machine learning and statistical models. The study found that all three resampling methods consistently improved the performance of the models. Among the three over-sampling methods, the adaptive synthetic sampling approach performed best and greatly enhanced the prediction of minority crash types without affecting the prediction of the majority crash type. This is likely due to the densitybased approach of adaptive synthetic sampling creating synthetic instances that better match the underlying manifold structure in the high-dimensional feature space.

Crash data analysis is commonly subjected to imbalanced data. Varied by facility and control types, some crash types are more frequent than others. However, uncommon crash types are routinely more severe and associated with higher economic and societal costs, and thus crucial to prevent. It is paramount to develop inferential models that can reliably predict crash types and identify attributing factors, especially for the severe types. The current process of modeling towards infrequent events generally disregards disparity in data representation, which can lead to biased models. Therefore, mitigating and managing imbalanced data is essential to the development of meaningful and robust models that help reveal effective countermeasures. This study focuses on comparing the effects of resampling techniques on the performance of both machine learning and classical statistical models for classifying and predicting different crash types on freeways. Specifically, a mixed sampling approach featuring a cluster-based under-sampling coupled with three popular over-sampling methods (i.e., random over-sampling, synthetic minority over-sampling, and adaptive synthetic sampling) were investigated with respect to four crash classification models, including three ensemble machine learning models (CatBoost, XGBoost, and Random Forests) and one classic statistical model (Nested Logit). This study concluded that all three resampling methods consistently enhanced the performance of all models. Among the three over-sampling methods, the adaptive synthetic sampling approach performed best and tremendously improved the prediction of minority crash types without impeding the prediction of the majority crash type. This is likely due to the densitybased approach of adaptive synthetic sampling in creating synthetic instances that are more congruent with the underlying manifold structure embodied in the high-dimensional feature space.

Effectiveness of resampling methods in coping with imbalanced crash data: Crash type analysis and predictive modeling

期刊

ACCIDENT ANALYSIS AND PREVENTION

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Effectiveness of resampling methods in coping with imbalanced crash data: Crash type analysis and predictive modeling

期刊

ACCIDENT ANALYSIS AND PREVENTION

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文