4.7 Article Proceedings Paper

SIBIS: a Bayesian model for inconsistent protein sequence estimation

Journal

BIOINFORMATICS
Volume 30, Issue 17, Pages 2432-2439

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btu329

Keywords

-

Funding

  1. Agence Nationale de la Recherche (BIPBIP) [ANR-10-BINF-03-02]
  2. Region Alsace and Institute funds from the CNRS (Centre Nationale de Recherche Scientifique)
  3. Universite de Strasbourg
  4. Faculte de Medecine de Strasbourg

Ask authors/readers for more resources

Motivation: The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors. Results: We have developed a new method, called SIBIS, to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences with experimentally validated errors and showed that the sensitivity is significantly higher than previous methods, with only a small loss of specificity. We also assessed a large set of human sequences from the UniProt database and found evidence of inconsistency in 48% of the previously uncharacterized sequences. We conclude that the integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available