☆ 4.6 Article

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

FRONTIERS OF COMPUTER SCIENCE (2014)

期刊

FRONTIERS OF COMPUTER SCIENCE

卷 8, 期 1, 页码 83-99

出版社

HIGHER EDUCATION PRESS

DOI: 10.1007/s11704-013-3158-3

关键词

data clustering; parallel algorithm; data mining; load balancing

类别

Computer Science, Information Systems Computer Science, Software Engineering Computer Science, Theory & Methods

资金

China National Science and Technology Pillar Program [2012BAH07B01]
Knowledge Innovation Project of the Chinese Academy of Sciences [KGCX2-YW-131]
Strategic Priority Research Program of the Chinese Academy of Sciences [XDA06010500]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

期刊

FRONTIERS OF COMPUTER SCIENCE

出版社

HIGHER EDUCATION PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

期刊

FRONTIERS OF COMPUTER SCIENCE

出版社

HIGHER EDUCATION PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文