Journal
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS
Volume 47, Issue 10, Pages 2727-2739Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TSMC.2017.2700889
Keywords
Apache Spark; big data; data streams; distributed computing; instance reduction; machine learning; nearest neighbor
Funding
- Spanish National Research Project [TIN2014-57251-P, TIN2016-81113-R]
- Andalusian Research Plan [P11-TIC-7765, P12-TIC-2958]
- FPU Scholarship from the Spanish Ministry of Education and Science [FPU13/00047]
- Polish National Science Center [DEC-2013/09/B/ST6/02264]
Ask authors/readers for more resources
Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available