☆ 4.4 Article Proceedings Paper

MARIGOLD: Efficient k-means Clustering in High Dimensions

PROCEEDINGS OF THE VLDB ENDOWMENT (2023)

Journal

PROCEEDINGS OF THE VLDB ENDOWMENT

Volume 16, Issue 7, Pages 1740-1748

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.14778/3587136.3587147

Keywords

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes Marigold, a scalable algorithm for k-means clustering in high dimensions, which reduces the need for repeatedly computing Euclidean distances by using a tight distance-bounding scheme, a stepwise calculation over a multiresolution transform, and exploiting the triangle inequality. Experimental results show that Marigold achieves a significant improvement in clustering high-dimensional data.

How can we efficiently and scalably cluster high-dimensional data? The k-means algorithm clusters data by iteratively reducing intra-cluster Euclidean distances until convergence. While it finds applications from recommendation engines to image segmentation, its application to high-dimensional data is hindered by the need to repeatedly compute Euclidean distances among points and centroids. In this paper, we propose Marigold (k-means for high-dimensional data), a scalable algorithm for k-means clustering in high dimensions. Marigold prunes distance calculations by means of (i) a tight distance-bounding scheme; (ii) a stepwise calculation over a multiresolution transform; and (iii) exploiting the triangle inequality. To our knowledge, such an arsenal of pruning techniques has not been hitherto applied to k-means. Our work is motivated by time-critical Angle-Resolved Photoemission Spectroscopy (ARPES) experiments, where it is vital to detect clusters among high-dimensional spectra in real time. In a thorough experimental study with real-world data sets we demonstrate that Marigold efficiently clusters high-dimensional data, achieving approximately one order of magnitude improvement over prior art.

MARIGOLD: Efficient k-means Clustering in High Dimensions

Journal

PROCEEDINGS OF THE VLDB ENDOWMENT

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

MARIGOLD: Efficient k-means Clustering in High Dimensions

Journal

PROCEEDINGS OF THE VLDB ENDOWMENT

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper