☆ 4.7 Article

Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS (2017)

Journal

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS

Volume 47, Issue 10, Pages 2727-2739

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TSMC.2017.2700889

Keywords

Apache Spark; big data; data streams; distributed computing; instance reduction; machine learning; nearest neighbor

Funding

Spanish National Research Project [TIN2014-57251-P, TIN2016-81113-R]
Andalusian Research Plan [P11-TIC-7765, P12-TIC-2958]
FPU Scholarship from the Spanish Ministry of Education and Science [FPU13/00047]
Polish National Science Center [DEC-2013/09/B/ST6/02264]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.

Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

Journal

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

Journal

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper