4.7 Article

Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

Journal

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS
Volume 47, Issue 10, Pages 2727-2739

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TSMC.2017.2700889

Keywords

Apache Spark; big data; data streams; distributed computing; instance reduction; machine learning; nearest neighbor

Funding

  1. Spanish National Research Project [TIN2014-57251-P, TIN2016-81113-R]
  2. Andalusian Research Plan [P11-TIC-7765, P12-TIC-2958]
  3. FPU Scholarship from the Spanish Ministry of Education and Science [FPU13/00047]
  4. Polish National Science Center [DEC-2013/09/B/ST6/02264]

Ask authors/readers for more resources

Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available