4.7 Article

Effectiveness of resampling methods in coping with imbalanced crash data: Crash type analysis and predictive modeling

期刊

ACCIDENT ANALYSIS AND PREVENTION
卷 159, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.aap.2021.106240

关键词

Machine learning; Gradient boosting; Tree ensemble; Nested logit; Traffic crash; Resampling; Over-sampling; Data imbalance

向作者/读者索取更多资源

This study focuses on comparing the effects of resampling techniques on the classification and prediction of different crash types on freeways using machine learning and statistical models. The study found that all three resampling methods consistently improved the performance of the models. Among the three over-sampling methods, the adaptive synthetic sampling approach performed best and greatly enhanced the prediction of minority crash types without affecting the prediction of the majority crash type. This is likely due to the densitybased approach of adaptive synthetic sampling creating synthetic instances that better match the underlying manifold structure in the high-dimensional feature space.
Crash data analysis is commonly subjected to imbalanced data. Varied by facility and control types, some crash types are more frequent than others. However, uncommon crash types are routinely more severe and associated with higher economic and societal costs, and thus crucial to prevent. It is paramount to develop inferential models that can reliably predict crash types and identify attributing factors, especially for the severe types. The current process of modeling towards infrequent events generally disregards disparity in data representation, which can lead to biased models. Therefore, mitigating and managing imbalanced data is essential to the development of meaningful and robust models that help reveal effective countermeasures. This study focuses on comparing the effects of resampling techniques on the performance of both machine learning and classical statistical models for classifying and predicting different crash types on freeways. Specifically, a mixed sampling approach featuring a cluster-based under-sampling coupled with three popular over-sampling methods (i.e., random over-sampling, synthetic minority over-sampling, and adaptive synthetic sampling) were investigated with respect to four crash classification models, including three ensemble machine learning models (CatBoost, XGBoost, and Random Forests) and one classic statistical model (Nested Logit). This study concluded that all three resampling methods consistently enhanced the performance of all models. Among the three over-sampling methods, the adaptive synthetic sampling approach performed best and tremendously improved the prediction of minority crash types without impeding the prediction of the majority crash type. This is likely due to the densitybased approach of adaptive synthetic sampling in creating synthetic instances that are more congruent with the underlying manifold structure embodied in the high-dimensional feature space.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据