☆ 4.2 Article

Benchmarking distance-based partitioning methods for mixed-type data

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION (2023)

Journal

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION

Volume 17, Issue 3, Pages 701-724

Publisher

SPRINGER HEIDELBERG

DOI: 10.1007/s11634-022-00521-7

Keywords

Cluster benchmarking; Partitioning; Mixed-type data; Heterogeneous data; K-Means

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper investigates the choice of clustering methods for mixed-type data and compares the performance of eight distance-based partitioning methods through a series of simulation experiments. The study finds that the amount of cluster overlap, the percentage of categorical variables, the number of clusters, and the number of observations have significant effects on cluster recovery.

Clustering mixed-type data, that is, observation by variable data that consist of both continuous and categorical variables poses novel challenges. Foremost among these challenges is the choice of the most appropriate clustering method for the data. This paper presents a benchmarking study comparing eight distance-based partitioning methods for mixed-type data in terms of cluster recovery performance. A series of simulations carried out by a full factorial design are presented that examined the effect of a variety of factors on cluster recovery. The amount of cluster overlap, the percentage of categorical variables in the data set, the number of clusters and the number of observations had the largest effects on cluster recovery and in most of the tested scenarios. KAMILA, K-Prototypes and sequential Factor Analysis and K-Means clustering typically performed better than other methods. The study can be a useful reference for practitioners in the choice of the most appropriate method.

Benchmarking distance-based partitioning methods for mixed-type data

Journal

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION

Publisher

SPRINGER HEIDELBERG

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Benchmarking distance-based partitioning methods for mixed-type data

Journal

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION

Publisher

SPRINGER HEIDELBERG

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper