4.7 Article

Accurate identification of bacteriophages from metagenomic data using Transformer

期刊

BRIEFINGS IN BIOINFORMATICS
卷 23, 期 4, 页码 -

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bib/bbac258

关键词

phage identification; protein cluster-based token; transformer; deep learning

资金

  1. City University of Hong Kong [9678241, 7005453]
  2. Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA)

向作者/读者索取更多资源

In this study, the state-of-the-art language model Transformer is used to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary and using the self-attention mechanism, the Transformer can learn the protein organization and associations, leading to improved phage detection.
Motivation Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. Results In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins' positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27%.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据