☆ 4.3 Article

Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING (2016)

Journal

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING

Volume 15, Issue 3, Pages -

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.1145/2812809

Keywords

Information retrieval; K-means; bisect K-means; Arabic stemmers; similarity measures

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

In this article, I have investigated the performance of the bisect K-means clustering algorithm compared to the standard K-means algorithm in the analysis of Arabic documents. The experiments included five commonly used similarity and distance functions (Pearson correlation coefficient, cosine, Jaccard coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) and three leading stemmers. Using the purity measure, the bisect K-means clearly outperformed the standard K-means in all settings with varying margins. For the bisect K-means, the best purity reached 0.927 when using the Pearson correlation coefficient function, while for the standard K-means, the best purity reached 0.884 when using the Jaccard coefficient function. Removing stop words significantly improved the results of the bisect K-means but produced minor improvements in the results of the standard K-means. Stemming provided additional minor improvement in all settings except the combination of the averaged Kullback-Leibler divergence function and the root-based stemmer, where the purity was deteriorated by more than 10%. These experiments were conducted using a dataset with nine categories, each of which contains 300 documents.

Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

Journal

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

Journal

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper