4.7 Article

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

Journal

ACM COMPUTING SURVEYS
Volume 54, Issue 2, Pages -

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3432999

Keywords

Text indexing; string searching; compressed data structures; repetitive string collections

Funding

  1. ANID Basal Funds [FB0001]
  2. Millennium Science Initiative Program [ICN17_002]
  3. Fondecyt, Chile [1-200038]

Ask authors/readers for more resources

The text discusses a breakthrough in indexing string collections, which allows representing them in a compressed space while offering indexed search functionalities. With the technology permeating through applications like bioinformatics, string collections have experienced rapid growth, particularly due to their high repetitiveness. New techniques have been developed to properly exploit this repetitiveness, leading to the creation of a new generation of data structures capable of handling huge repetitive string collections.
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures. In this second part, we describe the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations. We conclude with the current challenges in this fascinating field.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available