4.7 Article

SPRoBERTa: protein embedding learning with local fragment modeling

期刊

BRIEFINGS IN BIOINFORMATICS
卷 23, 期 6, 页码 -

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bib/bbac401

关键词

local fragment representation; protein tokenizer; protein pre-training

向作者/读者索取更多资源

This study introduces a novel protein pre-training modeling approach called SPRoBERTa, which uses an unsupervised protein tokenizer and a deep pre-training model framework to learn protein embeddings. Experimental results demonstrate significant improvements and outperformance in various protein tasks compared to previous methods.
Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据