☆ 4.7 Article

DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts

BIOINFORMATICS (2022)

期刊

BIOINFORMATICS

卷 38, 期 -, 页码 ii95-ii98

出版社

OXFORD UNIV PRESS

DOI: 10.1093/bioinformatics/btac474

关键词

类别

Biochemical Research Methods Biotechnology & Applied Microbiology Computer Science, Interdisciplinary Applications Mathematical & Computational Biology Statistics & Probability

资金

DSI of Bar-Ilan University [ECCB2022]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study adapted the concept of successful deep learning models in natural language processing to analyze protein sequences, successfully reducing the size and running time of the model by developing a distilled version. The distilled model showed promising performance in experimental tasks and distinguishing between real and random protein sequences, indicating its potential significance for protein sequence analysis.

Recently, deep learning models, initially developed in the field of natural language processing (NLP), were applied successfully to analyze protein sequences. A major drawback of these models is their size in terms of the number of parameters needed to be fitted and the amount of computational resources they require. Recently, 'distilled' models using the concept of student and teacher networks have been widely used in NLP. Here, we adapted this concept to the problem of protein sequence analysis, by developing DistilProtBert, a distilled version of the successful ProtBert model. Implementing this approach, we reduced the size of the network and the running time by 50%, and the computational resources needed for pretraining by 98% relative to ProtBert model. Using two published tasks, we showed that the performance of the distilled model approaches that of the full model. We next tested the ability of DistilProtBert to distinguish between real and random protein sequences. The task is highly challenging if the composition is maintained on the level of singlet, doublet and triplet amino acids. Indeed, traditional machine-learning algorithms have difficulties with this task. Here, we show that DistilProtBert preforms very well on singlet, doublet and even triplet-shuffled versions of the human proteome, with AUC of 0.92, 0.91 and 0.87, respectively. Finally, we suggest that by examining the small number of false-positive classifications (i.e. shuffled sequences classified as proteins by DistilProtBert), we may be able to identify de novo potential natural-like proteins based on random shuffling of amino acid sequences.

DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts

期刊

BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts

期刊

BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文