☆ 4.7 Article

Practical Efficient String Mining

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2012)

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Volume 24, Issue 4, Pages 735-744

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TKDE.2010.242

Keywords

String mining; suffix array; suffix tree; data mining; algorithms

Funding

Australian Research Council

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

In recent years, several algorithms for mining frequent and emerging substring patterns from databases of string data (such as proteins and natural language texts) have been discovered, all of which traverse an enhanced suffix array data structure. All of these algorithms lie at either extreme of the efficiency spectrum; they are either fast and use enormous amounts of space, or they are compact and orders of magnitude slower. In this paper, we present an algorithm that achieves the best of both these extremes, having runtime comparable to the fastest published algorithms while using less space than the most space efficient ones. This excellent practical performance is underpinned by theoretical guarantees. Our main mechanism for keeping memory usage low is to build the enhanced suffix array incrementally, in blocks. Once built, a block is traversed to output patterns with required support before its space is reclaimed to be used for the next block.

Practical Efficient String Mining

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Practical Efficient String Mining

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper