4.7 Article Proceedings Paper

The minimizer Jaccard estimator is biased and inconsistent

期刊

BIOINFORMATICS
卷 38, 期 SUPPL 1, 页码 169-176

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btac244

关键词

-

资金

  1. National Science Foundation [2029170, 1453527, 1931531, 1356529]
  2. Direct For Computer & Info Scie & Enginr
  3. Office of Advanced Cyberinfrastructure (OAC) [1931531] Funding Source: National Science Foundation
  4. Direct For Mathematical & Physical Scien
  5. Division Of Mathematical Sciences [2029170] Funding Source: National Science Foundation
  6. Div Of Biological Infrastructure
  7. Direct For Biological Sciences [1356529] Funding Source: National Science Foundation
  8. Div Of Information & Intelligent Systems
  9. Direct For Computer & Info Scie & Enginr [1453527] Funding Source: National Science Foundation

向作者/读者索取更多资源

This article investigates the bias and inconsistency issues of the minimizer sketch in estimating Jaccard similarity, and its impact on data processing accuracy.
Motivation: Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences. Results: We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据