4.6 Article

Accelerating the Original Profile Kernel

Journal

PLOS ONE
Volume 8, Issue 6, Pages -

Publisher

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pone.0068459

Keywords

-

Funding

  1. Alexander von Humboldt foundation through the German Ministry for Research and Education (BMBF: Bundesministerium fuer Bildung und Forschung

Ask authors/readers for more resources

One of the most accurate multi-class protein classification systems continues to be the profile-based SVM kernel introduced by the Leslie group. Unfortunately, its CPU requirements render it too slow for practical applications of large-scale classification tasks. Here, we introduce several software improvements that enable significant acceleration. Using various non-redundant data sets, we demonstrate that our new implementation reaches a maximal speed-up as high as 14-fold for calculating the same kernel matrix. Some predictions are over 200 times faster and render the kernel as possibly the top contender in a low ratio of speed/performance. Additionally, we explain how to parallelize various computations and provide an integrative program that reduces creating a production-quality classifier to a single program call. The new implementation is available as a Debian package under a free academic license and does not depend on commercial software. For non-Debian based distributions, the source package ships with a traditional Makefile-based installer. Download and installation instructions can be found at https://rostlab.org/owiki/index.php/Fast_Profile_Kernel. Bugs and other issues may be reported at https://rostlab.org/bugzilla3/enter_bug.cgi?product=fastprofkernel.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Review Biochemical Research Methods

Mutations in transmembrane proteins: diseases, evolutionary insights, prediction and comparison with globular proteins

Jan Zaucha, Michael Heinzinger, A. Kulandaisamy, Evans Kataka, Oscar Llorian Salvador, Petr Popov, Burkhard Rost, M. Michael Gromiha, Boris S. Zhorov, Dmitrij Frishman

Summary: Membrane proteins, by interacting with lipid bilayers, play crucial roles in transporting molecules and relaying signals between cells. Mutations in these proteins can have profound effects on the host's fitness, as shown in experimental studies and evolutionary signals.

BRIEFINGS IN BIOINFORMATICS (2021)

Article Multidisciplinary Sciences

Embeddings from deep learning transfer GO annotations beyond homology

Maria Littmann, Michael Heinzinger, Christian Dallago, Tobias Olenyi, Burkhard Rost

Summary: This study proposes a GO term prediction method based on SeqVec embedding and protein proximity, with promising results especially for proteins from smaller families or with intrinsically disordered regions.

SCIENTIFIC REPORTS (2021)

Article Genetics & Heredity

Embeddings from protein language models predict conservation and variant effects

Celine Marquet, Michael Heinzinger, Tobias Olenyi, Christian Dallago, Kyra Erckert, Michael Bernhofer, Dmitrii Nechaev, Burkhard Rost

Summary: The study utilized Protein Language Models (pLMs) to predict sequence conservation and SAV effects without requiring multiple sequence alignments (MSAs). The results showed that embeddings alone could accurately predict residue conservation almost as effectively as ConSeq using MSAs.

HUMAN GENETICS (2022)

Article Biochemistry & Molecular Biology

ProteomicsDB: toward a FAIR open-source resource for life-science research

Ludwig Lautenbacher, Patroklos Samaras, Julian Muller, Andreas Grafberger, Marwin Shraideh, Johannes Rank, Simon T. Fuchs, Tobias K. Schmidt, Matthew The, Christian Dallago, Holger Wittges, Burkhard Rost, Helmut Krcmar, Bernhard Kuster, Mathias Wilhelm

Summary: ProteomicsDB is a multi-omics and multi-organism resource for life science research, with efforts to improve the findability, accessibility, interoperability and reusability of data. New API and UI have been released, along with content expansions into different human biology and a newly supported organism.

NUCLEIC ACIDS RESEARCH (2022)

Editorial Material Biochemistry & Molecular Biology

Protein matchmaking through representation learning

Michael Heinzinger, Christian Dallago, Burkhard Rost

Summary: Sledzieski, Singh, Cowen, and Berger used representation learning to predict protein interactions and identify binding residues between protein pairs. Their work demonstrated the generalizability of training on one organism and evaluating on others, showcasing the potential of AI-learned representations in advancing knowledge in molecular biology.

CELL SYSTEMS (2021)

Article Biochemistry & Molecular Biology

Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction

Konstantin Weissenow, Michael Heinzinger, Burkhard Rost

Summary: This study describes a competitive prediction method that exclusively uses embeddings from pre-trained protein language models (pLMs) and does not require multiple sequence alignments (MSAs). By utilizing attention mechanisms, this method performs similarly to methods that rely on co-evolution, but at a lower cost. It may better capture features of specific protein structures, although it does not reach the level of AlphaFold2.

STRUCTURE (2022)

Correction Multidisciplinary Sciences

Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network (vol 12, 3279, 2021)

Mathys Grapotte, Manu Saraswat, Chloe Bessiere, Christophe Menichelli, Jordan A. Ramilowski, Jessica Severin, Yoshihide Hayashizaki, Masayoshi Itoh, Michihira Tagami, Mitsuyoshi Murata, Miki Kojima-Ishiyama, Shohei Noma, Shuhei Noguchi, Takeya Kasukawa, Akira Hasegawa, Harukazu Suzuki, Hiromi Nishiyori-Sueki, Martin C. Frith, Clement Chatelain, Piero Carninci, Michiel J. L. de Hoon, Wyeth W. Wasserman, Laurent Brehelin, Charles-Henri Lecellier

NATURE COMMUNICATIONS (2022)

Article Biochemical Research Methods

TMbed: transmembrane proteins predicted through language model embeddings

Michael Bernhofer, Burkhard Rost

Summary: In this study, a novel method called TMbed is proposed, which utilizes embeddings from protein language models to predict transmembrane regions of proteins. The method achieves high accuracy and low false positive rates in predicting alpha helical and beta barrel transmembrane proteins. TMbed is capable of processing large protein sequences on standard desktop computers and has the potential to be used for screening millions of predicted 3D structures.

BMC BIOINFORMATICS (2022)

Article Biochemical Research Methods

Engineering indel and substitution variants of diverse and ancient enzymes using Graphical Representation of Ancestral Sequence Predictions (GRASP)

Gabriel Foley, Ariane Mora, Connie M. Ross, Scott Bottoms, Leander Sutzl, Marnie L. Lamprecht, Julian Zaugg, Alexandra Essebier, Brad Balderson, Rhys Newell, Raine E. S. Thomson, Bostjan Kobe, Ross T. Barnard, Luke Guddat, Gerhard Schenk, Jorg Carsten, Yosephine Gumulya, Burkhard Rost, Dietmar Haltrich, Volker Sieber, Elizabeth M. J. Gillam, Mikael Boden

Summary: Ancestral sequence reconstruction is a powerful technique for recovering ancestral diversity and identifying building blocks using large data sets. The GRASP method efficiently implements maximum likelihood methods and uses partial order graphs to represent insertion and deletion events. By exploring variation over evolutionary time, GRASP enables the engineering of biologically active ancestral variants.

PLOS COMPUTATIONAL BIOLOGY (2022)

Article Biochemical Research Methods

CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models

Vamsi Nallapareddy, Nicola Bordin, Ian Sillitoe, Michael Heinzinger, Maria Littmann, Vaishali P. Waman, Neeladri Sen, Burkhard Rost, Christine Orengo

Summary: CATH is a protein domain classification resource that utilizes an automated workflow and manual curation to create a hierarchical classification of evolutionary and structural relationships. The study aimed to develop algorithms for detecting remote homologues missed by HMM-based approaches. The CATHe method, combining a neural network with sequence representations, showed high accuracy in identifying remote homologues.

BIOINFORMATICS (2023)

Article Biochemistry & Molecular Biology

LambdaPP: Fast and accessible protein-specific phenotype predictions

Tobias Olenyi, Celine Marquet, Michael Heinzinger, Benjamin Kroeger, Tiha Nikolova, Michael Bernhofer, Philip Saendig, Konstantin Schuetze, Maria Littmann, Milot Mirdita, Martin Steinegger, Christian Dallago, Burkhard Rost

Summary: The availability of accurate and fast AI solutions for predicting protein aspects is revolutionizing molecular biology. LambdaPP is a webserver aiming to replace the first internet server PredictProtein from 1992, providing AI protein predictions. LambdaPP offers accessible visualizations of protein 3D structure and predictions at both the protein level and residue level, including various phenotypes, within seconds.

PROTEIN SCIENCE (2023)

Review Biochemistry & Molecular Biology

Novel machine learning approaches revolutionize protein knowledge

Nicola Bordin, Christian Dallago, Michael Heinzinger, Stephanie Kim, Maria Littmann, Clemens Rauer, Martin Steinegger, Burkhard Rost, Christine Orengo

Summary: Breakthroughs in machine learning, protein structure prediction, and ultrafast structural aligners are revolutionizing structural biology. Large-scale acquisition of accurate protein models and functional annotation is no longer constrained by time and resources. AlphaFold 2, the latest top-ranked method in the CASP assessment, can build structural models with accuracy comparable to experimental structures. Recent advancements in protein language models and structural aligners facilitate the validation of transferred annotations for 3D models.

TRENDS IN BIOCHEMICAL SCIENCES (2023)

Article Mathematical & Computational Biology

Nearest neighbor search on embeddings rapidly identifies distant protein relations

Konstantin Schuetze, Michael Heinzinger, Martin Steinegger, Burkhard Rost

Summary: This article explores the use of embeddings for nearest neighbor searches to identify the relationships between protein pairs with diverged sequences. While the approach performs well for proteins with single domains, it faces challenges with multi-domain proteins. The authors present ideas to overcome these limitations.

FRONTIERS IN BIOINFORMATICS (2022)

Article Computer Science, Artificial Intelligence

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost

Summary: Computational biology and bioinformatics provide valuable data for the development of language models in natural language processing. In this study, six different models were trained on protein sequence data and the resulting embeddings were used for various protein structure prediction tasks, demonstrating their advantages over traditional methods.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2022)

Article Genetics & Heredity

Contrastive learning on protein embeddings enlightens midnight zone

Michael Heinzinger, Maria Littmann, Ian Sillitoe, Nicola Bordin, Christine Orengo, Burkhard Rost

Summary: The research utilizes embedding-based annotation transfer technique ProtTucker to optimize the classification of protein 3D structures through single protein representations, improving the recognition of distant homologous relationships. Compared to traditional techniques, this method performs better and is faster.

NAR GENOMICS AND BIOINFORMATICS (2022)

No Data Available