4.6 Article

Constructing benchmark test sets for biological sequence analysis using independent set algorithms

期刊

PLOS COMPUTATIONAL BIOLOGY
卷 18, 期 3, 页码 -

出版社

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pcbi.1009492

关键词

-

资金

  1. NSF-Simons Center for Mathematical and Statistical Analysis of Biology at Harvard [1764269]
  2. National Human Genome Research Institute of the National Institutes of Health [R01-HG009116]

向作者/读者索取更多资源

In benchmarking sequence analysis methods, splitting data into separate training and test sets is crucial. This study proposes two new methods, based on independent set graph algorithms, that successfully split sequence data into dissimilar training and test sets. This enables the construction of more diverse benchmark datasets.
Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据