Journal
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE
Volume 87, Issue -, Pages 66-82Publisher
ELSEVIER
DOI: 10.1016/j.future.2018.04.094
Keywords
Apache spark; MapReduce; Distributed computing; Big data; Multi-label classification; Nearest neighbors
Categories
Funding
- VCU Research Support Funds, USA
- Spanish Ministry of Economy and Competitiveness, Spain
- European Regional Development Fund, Belgium [TIN2017-83445-P]
Ask authors/readers for more resources
Modern data is characterized by its ever-increasing volume and complexity, particularly when data instances belong to many categories simultaneously. This learning paradigm is known as multi-label classification and one of its most renowned methods is the multi-label k nearest neighbor (ML-KNN). The traditional implementations of this method are not feasible for large-scale multi-label data due to its complexity and memory restrictions. We propose a distributed ML-KNN implementation based on the MapReduce programming model, implemented on Apache Spark. We compare three strategies for distributed nearest neighbor search: 1) iteratively broadcasting instances, 2) using a distributed tree-based index structure, and 3) building hash tables to group instances. The experimental study evaluates the trade-off between the quality of the predictions and runtimes on 22 benchmark datasets, and compares the scalability using different sizes of data. The results indicate that the tree-based index strategy outperforms the other approaches, having a speedup of up to 266x for the largest dataset, while achieving an accuracy equivalent to the exact methods. This strategy enables ML-KNN to scale efficiently with respect to the size of the problem. (C) 2018 Elsevier B.V. All rights reserved.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available