☆ 4.7 Article

An analysis of protein language model embeddings for fold prediction

BRIEFINGS IN BIOINFORMATICS (2022)

期刊

BRIEFINGS IN BIOINFORMATICS

卷 23, 期 3, 页码 -

出版社

OXFORD UNIV PRESS

DOI: 10.1093/bib/bbac142

关键词

protein fold prediction; protein language models; fine-tuning neural networks; embedding learning

类别

Biochemical Research Methods Mathematical & Computational Biology

资金

Ministerio de Ciencia e Innovacion (MCIN)/Agencia Estatal de Investigacion (AEI) [PID2019-104206GB-I00]
FPI grant [BES2017 -079792]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper analyzes a framework for protein fold prediction using pre-trained protein language model (LM) embeddings and compares the performance of different embeddings and neural network models. The results show that transformer-based embeddings, particularly at the amino acid level, combined with RBG and LAT fine-tuning models perform well in both pairwise fold recognition and direct fold classification tasks. Several ensemble strategies are proposed to further improve prediction accuracy.

The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.

An analysis of protein language model embeddings for fold prediction

期刊

BRIEFINGS IN BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

An analysis of protein language model embeddings for fold prediction

期刊

BRIEFINGS IN BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文