☆ 4.7 Article Proceedings Paper

A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2014)

期刊

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

卷 11, 期 1, 页码 42-54

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TCBB.2013.137

关键词

Metagenomics; binning; N-grams; feature weighting; algorithms

类别

Biochemical Research Methods Computer Science, Interdisciplinary Applications Mathematics, Interdisciplinary Applications Statistics & Probability

资金

China 863 Program [2012AA020403]
National Natural Science Foundation of China (NSFC) [61173118, 61272380]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

The rapid development of high-throughput technologies enables researchers to sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these sequence reads into different species or taxonomical classes is a crucial step for metagenomic analysis, which is referred to as binning of metagenomic data. Most traditional binning methods rely on known reference genomes for accurate assignment of the sequence reads, therefore cannot classify reads from unknown species without the help of close references. To overcome this drawback, unsupervised learning based approaches have been proposed, which need not any known species' reference genome for help. In this paper, we introduce a novel unsupervised method called MCluster for binning metagenomic sequences. This method uses N-grams to extract sequence features and utilizes automatic feature weighting to improve the performance of the basic K-means clustering algorithm. We evaluate MCluster on a variety of simulated data sets and a real data set, and compare it with three latest binning methods: AbundanceBin, MetaCluster 3.0, and MetaCluster 5.0. Experimental results show that MCluster achieves obviously better overall performance (F-measure) than AbundanceBin and MetaCluster 3.0 on long metagenomic reads (>= 800 bp); while compared with MetaCluster 5.0, MCluster obtains a larger sensitivity, and a comparable yet more stable F-measure on short metagenomic reads (<300 bp). This suggests that MCluster can serve as a promising tool for effectively binning metagenomic sequences.

A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting

期刊

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting

期刊

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文