4.5 Article

On ontology-driven document clustering using core semantic features

Journal

KNOWLEDGE AND INFORMATION SYSTEMS
Volume 28, Issue 2, Pages 395-421

Publisher

SPRINGER LONDON LTD
DOI: 10.1007/s10115-010-0370-4

Keywords

Clustering; Information gain; Semantic features; Ontology; Dimensionality reduction

Ask authors/readers for more resources

Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Ecology

Ecological prediction at macroscales using big data: Does sampling design matter?

Patricia A. Soranno, Kendra Spence Cheruvelil, Boyang Liu, Qi Wang, Pang-Ning Tan, Jiayu Zhou, Katelyn B. S. King, Ian M. McCullough, Jemma Stachelek, Meridith Bartley, Christopher T. Filstrup, Ephraim M. Hanks, Jean-Francois Lapierre, Noah R. Lottig, Erin M. Schliep, Tyler Wagner, Katherine E. Webster

ECOLOGICAL APPLICATIONS (2020)

Article Health Care Sciences & Services

Improving Heart Disease Risk Through Quality-Focused Diet Logging: Pre-Post Study of a Diet Quality Tracking App

Bum Chul Kwon, Courtland VanDam, Stephanie E. Chiuve, Hyung Wook Choi, Paul Entler, Pang-Ning Tan, Jina Huh-Yoo

JMIR MHEALTH AND UHEALTH (2020)

Article Computer Science, Artificial Intelligence

Spatio-Temporal Multi-Task Learning via Tensor Decomposition

Jianpeng Xu, Jiayu Zhou, Pang-Ning Tan, Xi Liu, Lifeng Luo

Summary: Predictive modeling of large-scale spatio-temporal data is a challenging problem that requires training models to predict target variables at multiple locations while preserving spatial and temporal dependencies. This paper explores the effectiveness of using supervised tensor decomposition for multi-task learning in spatio-temporal prediction. The proposed framework, SMART, encodes data as a third-order tensor and trains ensemble models based on interpretable latent factors extracted from the data to make predictions on test instances, incorporating known patterns as constraints.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2021)

Article Public, Environmental & Occupational Health

Using Machine Learning to Compare Provaccine and Antivaccine Discourse Among the Public on Social Media: Algorithm Development Study

Young Anna Argyris, Kafui Monu, Pang-Ning Tan, Colton Aarts, Fan Jiang, Kaleigh Anne Wiseley

Summary: The study compared the discursive topics chosen by pro- and antivaccine advocates in influencing the public, finding that antivaccine topics have greater intertopic distinctiveness and use all four message frames, while provaccine advocates have neglected having a clear problem statement.

JMIR PUBLIC HEALTH AND SURVEILLANCE (2021)

Proceedings Paper Computer Science, Artificial Intelligence

DeepGPD: A Deep Learning Approach for Modeling Geospatio-Temporal Extreme Events

Tyler Wilson, Pang-Ning Tan, Lifeng Luo

Summary: This paper presents a deep learning framework for long-term prediction of the distribution of extreme values at different locations and addresses the computational challenges associated with large-scale geospatio-temporal data.

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Unsupervised Anomaly Detection by Robust Density Estimation

Boyang Liu, Pang-Ning Tan, Jiayu Zhou

Summary: This paper proposes a robust deep density estimation framework for unsupervised anomaly detection, which improves the performance by discarding data points with low estimated densities and applying Lipschitz regularization.

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE (2022)

Proceedings Paper Computer Science, Artificial Intelligence

FACS-GCN: Fairness-Aware Cost-Sensitive Boosting of Graph Convolutional Networks

Francisco Santos, Junke Ye, Farzan Masrour, Pang-Ning Tan, Abdol-Hossein Esfahanian

Summary: Graph neural networks (GNNs) are widely used for modeling graph data by integrating node attributes and link information into concise representations. However, node classification using GNNs faces challenges such as imbalanced class distribution and the bias caused by the homophily effect. To address these challenges, we propose a novel framework called Fairness-Aware Cost Sensitive Graph Convolutional Network (FACS-GCN) that combines a cost-sensitive exponential loss and adversarial learning to achieve fair classification.

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) (2022)

Proceedings Paper Computer Science, Interdisciplinary Applications

Using Deep Learning to Identify Linguistic Features that Facilitate or Inhibit the Propagation of Anti- and Pro-Vaccine Content on Social Media

Young Anna Argyris, Nan Zhang, Bidhan Bashyal, Pang-Ning Tan

Summary: This study aims to investigate the linguistic features of vaccine-related content and their impact on propagation, identifying two sets of features that either facilitate or inhibit the spread of vaccine-related tweets. Results show that anti-vaccine tweets tend to be propagated through retweets, while pro-vaccine tweets mainly receive passive endorsements.

2022 IEEE INTERNATIONAL CONFERENCE ON DIGITAL HEALTH (IEEE ICDH 2022) (2022)

Proceedings Paper Computer Science, Artificial Intelligence

JOHAN: A Joint Online Hurricane Trajectory and Intensity Forecasting Framework

Ding Wang, Pang-Ning Tan

Summary: This paper introduces a novel online learning framework called JOHAN, which simultaneously predicts the trajectory and intensity of hurricanes, generates accurate forecasts of hurricane intensity categories, and uses exponentially-weighted quantile loss functions to improve prediction accuracy for high category hurricanes approaching landfall. Experimental results show the superiority of JOHAN over several state-of-the-art learning approaches using real-world hurricane data.

KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING (2021)

Article Computer Science, Information Systems

Automated Analysis of the US Drought Monitor Maps With Machine Learning and Multiple Drought Indicators

Pouyan Hatami Bahman Beiglou, Lifeng Luo, Pang-Ning Tan, Lisi Pei

Summary: The US Drought Monitor is a vital tool for real-time drought monitoring, but its production involves human judgment, making it difficult for others to reproduce the maps. This study developed a framework using machine learning to automatically generate similar maps, with the support vector machines algorithm and specific data group achieving near-perfect reproduction accuracy.

FRONTIERS IN BIG DATA (2021)

Article Communication

Trick or Drink: Offline and Social Media Hierarchical Normative Influences on Halloween Celebration Drinking

Saleem Alhabash, Duygu Kanver, Chen Lou, Sandi W. Smith, Pang-Ning Tan

Summary: The study found that underage youth's perception of societal and personal celebration drinking norms were related to their close friends' drinking norms, which influenced their alcohol consumption during Halloween. Additionally, social media posting and interaction with alcohol-related content were associated with greater descriptive normative perceptions and self-reported drinking.

HEALTH COMMUNICATION (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Fairness Perception from a Network-Centric Perspective

Farzan Masrour, Pang-Ning Tan, Abdol-Hossein Esfahanian

20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2020) (2020)

Proceedings Paper Computer Science, Artificial Intelligence

Convolutional Methods for Predictive Modeling of Geospatial Data

Tyler Wilson, Pang-Ning Tan, Lifeng Luo

PROCEEDINGS OF THE 2020 SIAM INTERNATIONAL CONFERENCE ON DATA MINING (SDM) (2020)

Article Limnology

Increasing accuracy of lake nutrient predictions in thousands of lakes by leveraging water clarity data

Tyler Wagner, Noah R. Lottig, Meridith L. Bartley, Ephraim M. Hanks, Erin M. Schliep, Nathan B. Wikle, Katelyn B. S. King, Ian McCullough, Jemma Stachelek, Kendra S. Cheruvelil, Christopher T. Filstrup, Jean Francois Lapierre, Boyang Liu, Patricia A. Soranno, Pang-Ning Tan, Qi Wang, Katherine Webster, Jiayu Zhou

LIMNOLOGY AND OCEANOGRAPHY LETTERS (2020)

Article Communication

Celebration Drinking around the Clock

Sandi W. Smith, Saleem Alhabash, Duygu Kanver, Pang-Ning Tan, Greg Viken

HEALTH COMMUNICATION (2020)

No Data Available