☆ 4.7 Article

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

ACM COMPUTING SURVEYS (2021)

Journal

ACM COMPUTING SURVEYS

Volume 54, Issue 2, Pages -

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3432999

Keywords

Text indexing; string searching; compressed data structures; repetitive string collections

Funding

ANID Basal Funds [FB0001]
Millennium Science Initiative Program [ICN17_002]
Fondecyt, Chile [1-200038]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The text discusses a breakthrough in indexing string collections, which allows representing them in a compressed space while offering indexed search functionalities. With the technology permeating through applications like bioinformatics, string collections have experienced rapid growth, particularly due to their high repetitiveness. New techniques have been developed to properly exploit this repetitiveness, leading to the creation of a new generation of data structures capable of handling huge repetitive string collections.

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures. In this second part, we describe the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations. We conclude with the current challenges in this fascinating field.

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

Journal

ACM COMPUTING SURVEYS

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

Journal

ACM COMPUTING SURVEYS

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper