4.6 Article

What's in a Likelihood? Simple Models of Protein Evolution and the Contribution of Structurally Viable Reconstructions to the Likelihood

期刊

SYSTEMATIC BIOLOGY
卷 60, 期 2, 页码 161-174

出版社

OXFORD UNIV PRESS
DOI: 10.1093/sysbio/syq088

关键词

Ancestral state reconstruction; empirical amino acid models; maximum likelihood; phylogenetics; protein structure

资金

  1. Marie Curie Fellowship
  2. National Science Foundation [DEB 1036500]
  3. Division Of Environmental Biology
  4. Direct For Biological Sciences [1132229] Funding Source: National Science Foundation

向作者/读者索取更多资源

Most phylogenetic models of protein evolution assume that sites are independent and identically distributed. Interactions between sites are ignored, and the likelihood can be conveniently calculated as the product of the individual site likelihoods. The calculation considers all possible transition paths (also called substitution histories or mappings) that are consistent with the observed states at the terminals, and the probability density of any particular reconstruction depends on the substitution model. The likelihood is the integral of the probability density of each substitution history taken over all possible histories that are consistent with the observed data. We investigated the extent to which transition paths that are incompatible with a protein's three-dimensional structure contribute to the likelihood. Several empirical amino acid models were tested for sequence pairs of different degrees of divergence. When simulating substitutional histories starting from a real sequence, the structural integrity of the simulated sequences quickly disintegrated. This result indicates that simple models are clearly unable to capture the constraints on sequence evolution. However, when we sampled transition paths between real sequences from the posterior probability distribution according to these same models, we found that the sampled histories were largely consistent with the tertiary structure. This suggests that simple empirical substitution models may be adequate for interpolating changes between observed sequences during phylogenetic inference despite the fact that the models cannot predict the effects of structural constraints from first principles. This study is significant because it provides a quantitative assessment of the biological realism of substitution models from the perspective of protein structure, and it provides insight on the prospects for improving models of protein sequence evolution.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Article Evolutionary Biology

Ambiguity Coding Allows Accurate Inference of Evolutionary Parameters from Alignments in an Aggregated State-Space

Claudia C. Weber, Umberto Perron, Dearbhaile Casey, Ziheng Yang, Nick Goldman

Summary: This study discusses how to accurately estimate parameters related to protein evolution by handling missing data, and demonstrates that combining ambiguous-coded and fully resolved data inputs can improve accuracy. By establishing connections between observed information in different state spaces, evolutionary information can be successfully recovered from sequences that were previously inaccessible.

SYSTEMATIC BIOLOGY (2021)

Article Biochemical Research Methods

Sampling bias and model choice in continuous phylogeography: Getting lost on a random walk

Antanas Kalkauskas, Umberto Perron, Yuxuan Sun, Nick Goldman, Guy Baele, Stephane Guindon, Nicola De Maio

Summary: The author explores the effects of different model assumptions on phylogeographic inference and discovers that sample collection biases can strongly impact the quality of reconstruction. They suggest various strategies to counter these effects, but note that they come with additional computational burden. Additionally, they investigate the differences of various phylogeographic models and their suitability in different scenarios.

PLOS COMPUTATIONAL BIOLOGY (2021)

Editorial Material Multidisciplinary Sciences

Want to track pandemic variants faster? Fix the bioinformatics bottleneck Comment

Emma B. Hodcroft, Nicola De Maio, Rob Lanfear, Duncan R. MacCannell, Bui Quang Minh, Heiko A. Schmidt, Alexandros Stamatakis, Nick Goldman, Christophe Dessimoz

Summary: Researchers are in need of new approaches to control the pandemic as existing tools, rules, and incentives are struggling to cope with the flood of coronavirus genome sequences.

NATURE (2021)

Article Genetics & Heredity

Short-range template switching in great ape genomes explored using pair hidden Markov models

Conor R. Walker, Aylwyn Scally, Nicola De Maio, Nick Goldman

Summary: Many complex genomic rearrangements arise through template switch errors during DNA replication. By using an improved statistical approach, it has been shown that template switch events have been widespread in the evolution of great apes' genomes and provide a parsimonious explanation for the presence of many complex mutation clusters in their phylogenetic context. Larger-scale mechanisms of genome rearrangement involve structural features around breakpoints, with atypical patterns of secondary structure formation and DNA bending present at the initial template switch loci.

PLOS GENETICS (2021)

Article Evolutionary Biology

Mutation Rates and Selection on Synonymous Mutations in SARS-CoV-2

Nicola De Maio, Conor R. Walker, Yatish Turakhia, Robert Lanfear, Russell Corbett-Detig, Nick Goldman

Summary: The COVID-19 pandemic has prompted an unprecedented response from the sequencing community, leading to a study of mutation rates and selective pressures using sequence data from over 140,000 SARS-CoV-2 genomes. Two specific mutation rates, G -> U and C -> U, were found to be significantly higher than others, possibly attributed to APOBEC and ROS activity. Genomic context does have an effect on mutation rates, but its impact is limited.

GENOME BIOLOGY AND EVOLUTION (2021)

Article Evolutionary Biology

OpenTree: A Python Package for Accessing and Analyzing Data from the Open Tree of Life

Emily Jane Mctavish, Luna Luisa Sanchez-Reyes, Mark T. Holder

Summary: The Open Tree of Life project aims to create a comprehensive and digitally available tree of life by synthesizing published phylogenetic trees and taxonomic data, with APIs provided for easy access. The Python package opentree offers a user-friendly wrapper for these APIs and provides scripts and tutorials for data analysis. This tool has been used to estimate phylogenetic relationships for bird families and taxa observed at a specific natural reserve.

SYSTEMATIC BIOLOGY (2021)

Article Biochemical Research Methods

Incorporating the speciation process into species delimitation

Jeet Sukumaran, Mark T. Holder, L. Lacey Knowles

Summary: The traditional multispecies coalescent (MSC) model fails to distinguish genetic structures between species and within species, leading to the emergence of artifactual species under high-resolution data. The new species delimitation approach explicitly models speciation as an extended process, allowing for more accurate discrimination between genetic structures corresponding to species lineages and population lineages within species, providing insights into the relationship between population and species-level processes.

PLOS COMPUTATIONAL BIOLOGY (2021)

Article Biochemical Research Methods

A phylogenetic approach for weighting genetic sequences

Nicola De Maio, Alexander Alekseyenko, William J. Coleman-Smith, Fabio Pardi, Marc A. Suchard, Asif U. Tamuri, Jakub Truszkowski, Nick Goldman

Summary: This study introduced a novel method called "phylogenetic novelty scores" to address sequence weighting in bioinformatics, formalizing the evolutionary novelty of a sequence within an alignment. The method showed promising results in computational efficiency and accuracy improvement in sequence alignment.

BMC BIOINFORMATICS (2021)

Article Biochemistry & Molecular Biology

A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees

Jakob McBroome, Bryan Thornlow, Angie S. Hinrichs, Alexander Kramer, Nicola De Maio, Nick Goldman, David Haussler, Russell Corbett-Detig, Yatish Turakhia

Summary: A database of SARS-CoV-2 phylogenetic trees inferred with public sequences is presented, updated daily to include new sequences and encoded in MAT format. The researchers also introduce matUtils software for querying and manipulating the MATs efficiently.

MOLECULAR BIOLOGY AND EVOLUTION (2021)

Article Multidisciplinary Sciences

Genomic reconstruction of the SARS-CoV-2 epidemic in England

Harald S. Vohringer, Theo Sanderson, Matthew Sinnott, Nicola De Maio, Thuy Nguyen, Richard Goater, Frank Schwach, Ian Harrison, Joel HeHowells, Cristina Ariani, Sonia Goncalves, David K. Jackson, Ian Johnstone, Alexander W. Jung, Callum Saint, John Sillitoe, Maria Suciu, Nick Goldman, Jasmine Panovska-Griffiths, Ewan Birney, Erik Volz, Sebastian Funk, Dominic Kwiatkowski, Meera Chand, Inigo Martincorena, Jeffrey C. Barrett, Moritz Gerstung

Summary: The study analyzed the dynamics of different lineages in English local authorities using real-time genomic data. The findings showed significant fluctuations in transmissibility and proportions of different variants over time, with Delta variant rapidly increasing in early summer 2021.

NATURE (2021)

Article Biochemical Research Methods

phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets

Nicola R. De Maio, William Boulton, Lukas Weilguny, Conor Walker, Yatish Turakhia, Russell O. Corbett-Detig, Nick Goldman, Ville Mustonen, Joel Wertheim

Summary: This article introduces a new algorithm and software for efficiently simulating a large number of closely related genomes. The algorithm is based on the Gillespie approach and utilizes an efficient multi-layered search tree structure to achieve high computational efficiency, allowing integration with various evolutionary models.

PLOS COMPUTATIONAL BIOLOGY (2022)

Correction Multidisciplinary Sciences

Genomic reconstruction of the SARS CoV-2 epidemic in England (vol 600, pg 506, 2021)

Harald S. Vohringer, Theo Sanderson, Matthew Sinnott, Nicola De Maio, Thuy Nguyen, Richard Goater, Frank Schwach, Ian Harrison, Joel Hellewell, Cristina V. Ariani, Sonia Goncalves, David K. Jackson, Ian Johnston, Alexander W. Jung, Callum Saint, John Sillitoe, Maria Suciu, Nick Goldman

NATURE (2022)

Article Biotechnology & Applied Microbiology

Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design

Lukas Weilguny, Nicola De Maio, Rory Munro, Charlotte Manser, Ewan Birney, Matthew Loose, Nick Goldman

Summary: BOSS-RUNS is an algorithmic framework and software that dynamically updates decision strategies based on real-time updates of uncertainty at each genome position. It optimizes information gain by deciding whether to fully sequence each DNA fragment, leading to improved variant calling in microbial communities.

NATURE BIOTECHNOLOGY (2023)

Article Genetics & Heredity

Maximum likelihood pandemic-scale phylogenetics

Nicola De Maio, Prabhav Kalaghatgi, Yatish Turakhia, Russell Corbett-Detig, Bui Quang Minh, Nick Goldman

Summary: Phylogenetics plays a crucial role in genomic epidemiology, and the COVID-19 pandemic has generated an unprecedented amount of genome sequence data for analysis. However, most phylogenetic approaches are unable to handle the scale of these datasets. This study presents a new method called 'MAximum Parsimonious Likelihood Estimation' (MAPLE) for likelihood-based phylogenetic analysis of large genomic datasets. MAPLE is faster, more accurate, and requires significantly less memory compared to existing maximum likelihood methods, enabling the analysis of millions of genomes.

NATURE GENETICS (2023)

Article Evolutionary Biology

DNA Sequences Are as Useful as Protein Sequences for Inferring Deep Phylogenies

Paschalia Kapli, Ioanna Kotari, Maximilian J. Telford, Nick Goldman, Ziheng Yang

Summary: Inference of deep phylogenies has primarily used protein sequences, but our analysis shows that DNA sequences may be just as useful and should not be excluded. We conducted a simulation study and analyzed empirical data, which suggest that DNA sequences can recover the correct tree as often as protein sequences. Using DNA data has computational advantages and allows for advanced models that account for heterogeneity in the nucleotide-substitution process.

SYSTEMATIC BIOLOGY (2023)

暂无数据