☆ 4.7 Article

Distributed nearest neighbor classification for large-scale multi-label data on spark

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE (2018)

Journal

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE

Volume 87, Issue -, Pages 66-82

Publisher

ELSEVIER

DOI: 10.1016/j.future.2018.04.094

Keywords

Apache spark; MapReduce; Distributed computing; Big data; Multi-label classification; Nearest neighbors

Funding

VCU Research Support Funds, USA
Spanish Ministry of Economy and Competitiveness, Spain
European Regional Development Fund, Belgium [TIN2017-83445-P]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Modern data is characterized by its ever-increasing volume and complexity, particularly when data instances belong to many categories simultaneously. This learning paradigm is known as multi-label classification and one of its most renowned methods is the multi-label k nearest neighbor (ML-KNN). The traditional implementations of this method are not feasible for large-scale multi-label data due to its complexity and memory restrictions. We propose a distributed ML-KNN implementation based on the MapReduce programming model, implemented on Apache Spark. We compare three strategies for distributed nearest neighbor search: 1) iteratively broadcasting instances, 2) using a distributed tree-based index structure, and 3) building hash tables to group instances. The experimental study evaluates the trade-off between the quality of the predictions and runtimes on 22 benchmark datasets, and compares the scalability using different sizes of data. The results indicate that the tree-based index strategy outperforms the other approaches, having a speedup of up to 266x for the largest dataset, while achieving an accuracy equivalent to the exact methods. This strategy enables ML-KNN to scale efficiently with respect to the size of the problem. (C) 2018 Elsevier B.V. All rights reserved.

Distributed nearest neighbor classification for large-scale multi-label data on spark

Journal

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Distributed nearest neighbor classification for large-scale multi-label data on spark

Journal

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper