☆ 4.7 Article

Sequence-based prediction of protein-binding sites in DNA: Comparative study of two SVM models

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE (2014)

期刊

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE

卷 117, 期 2, 页码 158-167

出版社

ELSEVIER IRELAND LTD

DOI: 10.1016/j.cmpb.2014.07.009

关键词

DNA-protein interactions; Binding sites; Protein-binding nucleotides; Prediction model

类别

Computer Science, Interdisciplinary Applications Computer Science, Theory & Methods Engineering, Biomedical Medical Informatics

资金

National Research Foundation of Korea (NRF) - Ministry of Science, ICT and Future Planning [NRF-2012R1A1A3011982]
Ministry of Education and Inha University [2010-0020163]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

As many structures of protein-DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein-DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of our knowledge, this is the first attempt to predict protein-binding nucleotides in a given DNA sequence from the sequence data alone. (C) 2014 Elsevier Ireland Ltd. All rights reserved.

Sequence-based prediction of protein-binding sites in DNA: Comparative study of two SVM models

期刊

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE

出版社

ELSEVIER IRELAND LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Sequence-based prediction of protein-binding sites in DNA: Comparative study of two SVM models

期刊

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE

出版社

ELSEVIER IRELAND LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文