4.7 Article

The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis

期刊

BIOINFORMATICS
卷 35, 期 1, 页码 12-19

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/bty523

关键词

-

资金

  1. National Institutes of Health (NIH) [R01 GM118709]
  2. Extreme Science and Engineering Discovery Environment (XSEDE) project (NSF) [ACI-1053575]
  3. National Research Service Award (NRSA) individual fellowship [F31GM116570]
  4. Medical Scientist Training Program (MSTP) grant [T32GM007288]

向作者/读者索取更多资源

Motivation: The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict similar to 25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported 'F-Scores' of similar to 0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs). Results: The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of similar to 0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of similar to 0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据