4.6 Article

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

Journal

PLOS COMPUTATIONAL BIOLOGY
Volume 13, Issue 10, Pages -

Publisher

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pcbi.1005777

Keywords

-

Funding

  1. Israel Science Foundation as part of the ISF-NSFC
  2. Edmond J. Safra Center for Bioinformatics at Tel-Aviv University
  3. Israel Ministry of Immigrant Absorption
  4. Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative [GBMF4554]
  5. US National Science Foundation [CCF-1256087, CCF-1319998]
  6. US National Institutes of Health [R01HG007104]
  7. Direct For Computer & Info Scie & Enginr
  8. Division of Computing and Communication Foundations [1319998] Funding Source: National Science Foundation

Ask authors/readers for more resources

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available