4.5 Article

High performance genetic algorithm based text clustering using parts of speech and outlier elimination

Journal

APPLIED INTELLIGENCE
Volume 38, Issue 4, Pages 511-519

Publisher

SPRINGER
DOI: 10.1007/s10489-012-0382-8

Keywords

Text clustering; Unsupervised categorization; Genetic algorithm; Parts of speech; Outliers; Similarity measurement

Funding

  1. National Natural Science Foundation [60970107, 61073150]
  2. National Incubation Center

Ask authors/readers for more resources

Among the typical clustering methods, the K-means algorithm plays the most important role in clustering because of its simplicity and efficiency. However, it is sensitive to the initial points and easy to fall into local optimum. In order to avoid this kind of flaw, a patented text clustering algorithm Clustering by Genetic Algorithm Model (CGAM) is revealed in this paper. CGAM constructs the fitness function of genetic algorithm (GA) and convergence criterion for K-means algorithm because GA simulates the natural evolutionary process and deals with a larger search space. To tackle the rich semantics of Chinese texts, CGAM creates an innovative selection method of initial centers of GA and accommodates the contribution of characteristics of different parts of speech. Moreover, the impact of outliers is addressed and treated. Its performance is demonstrated by a series of experiments based on both Reuters-21578 and Chinese text corpus. Experimental results show that the CGAM achieves clustering results better than other GA based K-means algorithms and has been successfully applied to national program of business intelligence system in the context of huge set of contents in both Chinese and English.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available