☆ 4.5 Article

High performance genetic algorithm based text clustering using parts of speech and outlier elimination

APPLIED INTELLIGENCE (2013)

Journal

APPLIED INTELLIGENCE

Volume 38, Issue 4, Pages 511-519

Publisher

SPRINGER

DOI: 10.1007/s10489-012-0382-8

Keywords

Text clustering; Unsupervised categorization; Genetic algorithm; Parts of speech; Outliers; Similarity measurement

Funding

National Natural Science Foundation [60970107, 61073150]
National Incubation Center

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Among the typical clustering methods, the K-means algorithm plays the most important role in clustering because of its simplicity and efficiency. However, it is sensitive to the initial points and easy to fall into local optimum. In order to avoid this kind of flaw, a patented text clustering algorithm Clustering by Genetic Algorithm Model (CGAM) is revealed in this paper. CGAM constructs the fitness function of genetic algorithm (GA) and convergence criterion for K-means algorithm because GA simulates the natural evolutionary process and deals with a larger search space. To tackle the rich semantics of Chinese texts, CGAM creates an innovative selection method of initial centers of GA and accommodates the contribution of characteristics of different parts of speech. Moreover, the impact of outliers is addressed and treated. Its performance is demonstrated by a series of experiments based on both Reuters-21578 and Chinese text corpus. Experimental results show that the CGAM achieves clustering results better than other GA based K-means algorithms and has been successfully applied to national program of business intelligence system in the context of huge set of contents in both Chinese and English.

High performance genetic algorithm based text clustering using parts of speech and outlier elimination

Journal

APPLIED INTELLIGENCE

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

High performance genetic algorithm based text clustering using parts of speech and outlier elimination

Journal

APPLIED INTELLIGENCE

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper