☆ 4.3 Article

Prediction of chemical carcinogenicity by machine learning approaches

SAR AND QSAR IN ENVIRONMENTAL RESEARCH (2009)

期刊

SAR AND QSAR IN ENVIRONMENTAL RESEARCH

卷 20, 期 1-2, 页码 27-75

出版社

TAYLOR & FRANCIS LTD

DOI: 10.1080/10629360902724085

关键词

support vector machine; carcinogenicity; feature selection; Monte Carlo simulated annealing

类别

Chemistry, Multidisciplinary Computer Science, Interdisciplinary Applications Environmental Sciences Mathematical & Computational Biology Toxicology

资金

National Natural Science Foundation of China [20572073]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

In this paper we report a successful application of machine learning approaches to the prediction of chemical carcinogenicity. Two different approaches, namely a support vector machine (SVM) and artificial neural network (ANN), were evaluated for predicting chemical carcinogenicity from molecular structure descriptors. A diverse set of 844 compounds, including 600 carcinogenic (CG+) and 244 noncarcinogenic (CG-) molecules, was used to estimate the accuracies of these approaches. The database was divided into two sets: the model construction set and the independent test set. Relevant molecular descriptors were selected by a hybrid feature selection method combining Fischer's score and Monte Carlo simulated annealing from a wide set of molecular descriptors, including physiochemical properties, constitutional, topological, and geometrical descriptors. The first model validation method was based a five-fold cross-validation method, in which the model construction set is split into five subsets. The five-fold cross-validation was used to select descriptors and optimise the model parameters by maximising the averaged overall accuracy. The final SVM model gave an averaged prediction accuracy of 90.7% for CG+ compounds, 81.6% for CG- compounds and 88.1% for the overall accuracy, while the corresponding ANN model provided an averaged prediction accuracy of 86.1% for CG+ compounds, 79.3% for CG- compounds and 84.2% for the overall accuracy. These results indicate that the hybrid feature selection method is very efficient and the selected descriptors are truly relevant to the carcinogenicity of compounds. Another model validation method, i.e. a hold-out method, was used to build the classification model using the selected descriptors and the optimised model parameters, in which the whole model construction set was used to build the classification model and the independent test set was used to test the predictive ability of the model. The SVM model gave a prediction accuracy of 87.6% for CG+ compounds, 79.1% for CG- compounds and 85.0% for the overall accuracy. The ANN model gave a prediction accuracy of 85.6% for CG+ compounds, 79.1% for CG- compounds and 83.6% for the overall accuracy. The results indicate that the built models are potentially useful for facilitating the prediction of chemical carcinogenicity of untested compounds.

Prediction of chemical carcinogenicity by machine learning approaches

期刊

SAR AND QSAR IN ENVIRONMENTAL RESEARCH

出版社

TAYLOR & FRANCIS LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Prediction of chemical carcinogenicity by machine learning approaches

期刊

SAR AND QSAR IN ENVIRONMENTAL RESEARCH

出版社

TAYLOR & FRANCIS LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文