☆ 4.5 Article

On ontology-driven document clustering using core semantic features

KNOWLEDGE AND INFORMATION SYSTEMS (2011)

Journal

KNOWLEDGE AND INFORMATION SYSTEMS

Volume 28, Issue 2, Pages 395-421

Publisher

SPRINGER LONDON LTD

DOI: 10.1007/s10115-010-0370-4

Keywords

Clustering; Information gain; Semantic features; Ontology; Dimensionality reduction

Categories

Computer Science, Artificial Intelligence Computer Science, Information Systems

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5

Not enough ratings

Secondary Ratings

Novelty

-

Significance

-

Scientific rigor

-

Rate this paper

Recommended

Article Computer Science, Artificial Intelligence

Semantic decision Trees: A new learning system for the ID3-Based algorithm using a knowledge base

Sirichanya Chanmee, Kraisak Kesorn

Summary: This study introduces a new approach called the Semantic Decision Tree (SDT) to effectively address the multi-value bias selection issue and improve the generation of decision tree nodes. Evaluation results on multiple datasets show that SDT outperforms traditional algorithms in terms of accuracy and aligns more naturally with human decision-making logic.

ADVANCED ENGINEERING INFORMATICS (2023)

Add to Collection

Article Engineering, Environmental

Identifying key features in reactive flows: A tutorial on combining dimensionality reduction, unsupervised clustering, and feature correlation

Marc Rovira, Klas Engvall, Christophe Duwig

Summary: This study examines the capabilities of a data-driven workflow for automated key feature identification in reactive flows. The proposed workflow aims to accelerate the analysis of chemical engineering datasets by generating automatic and explainable classification results for regions with distinct physics. The three main steps of the workflow, namely dimensionality reduction, unsupervised clustering, and feature correlation, are discussed. The study demonstrates the theoretical and practical differences between the previous and current algorithms used in the workflow. The updated workflow is shown to have faster, more accurate, and more robust key feature identification capabilities, closer to human intuition than previous methods. The study also serves as a tutorial for researchers interested in applying these algorithms.

CHEMICAL ENGINEERING JOURNAL (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Toward Projected Clustering With Aggregated Mapping

Hongyuan Zhang, Yanan Zhu, Xuelong Li

Summary: This study proposes a novel projected clustering framework to capture the essence of deep clustering by summarizing the core properties of powerful models, especially deep models. The framework introduces an aggregated mapping, consisting of projection learning and neighbor estimation, to obtain clustering-friendly representation. The study also addresses the problem of severe degeneration in simple clustering-friendly representation learning, and develops a self-evolution mechanism to alleviate the risk of over-fitting.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2023)

Add to Collection

Article Mathematics, Interdisciplinary Applications

Phases and Their Transitions Characterizing the Dynamics of Global Terrorism: A Multidimensional Scaling and Visualization Approach

Antonio M. Lopes

Summary: This paper proposes an unsupervised machine learning technique to analyze global terrorism dynamics, identifying phases and phase transitions. The study uses a dataset of worldwide terrorist incidents from 1970 to 2019 to generate multidimensional time-series representing casualties and events. The time-series are sliced and the resulting segments are characterized as objects that capture the system dynamics. These objects are compared and categorized using multidimensional scaling (MDS), generating portraits that illustrate the patterns and nature of the dynamics. The results demonstrate the effectiveness of MDS in analyzing global terrorism and its potential for studying other complex systems.

INTERNATIONAL JOURNAL OF BIFURCATION AND CHAOS (2023)

Add to Collection

Article Mathematics, Interdisciplinary Applications

Multidimensional scaling and visualization of patterns in global large-scale accidents

Antonio M. Lopes, J. A. Tenreiro Machado

Summary: This paper proposes an approach based on unsupervised machine learning to identify phases and phase transitions in complex systems. By generating multidimensional time-series and analyzing them using multidimensional scaling technique, the study finds that this method is relevant for modeling the behavior of complex systems.

CHAOS SOLITONS & FRACTALS (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Text mining using nonnegative matrix factorization and latent semantic analysis

Ali Hassani, Amir Iranmanesh, Najme Mansouri

Summary: This study introduces a new feature agglomeration method based on nonnegative matrix factorization and proposes a deterministic initialization method for spherical K-means algorithm, which significantly improves the stability and performance of text data clustering.

NEURAL COMPUTING & APPLICATIONS (2021)

Add to Collection

Article Chemistry, Multidisciplinary

Unsupervised Deep Embedded Clustering for High-Dimensional Visual Features of Fashion Images

Umar Subhan Malhi, Junfeng Zhou, Cairong Yan, Abdur Rasool, Shahbaz Siddeeq, Ming Du

Summary: This paper proposes a fashion image clustering method based on deep clustering, which uses convolutional neural networks to generate high-dimensional feature vectors and then reduces dimensions through auto-encoders before performing clustering. By jointly learning and optimizing the dimensionality reduction process and the clustering task, the proposed method achieves state-of-the-art performance.

APPLIED SCIENCES-BASEL (2023)

Add to Collection

Article Green & Sustainable Science & Technology

Optimized time reduction models applied to power and energy systems planning - Comparison with existing methods

Remy Rigo-Mariani

Summary: The paper proposes a strategy for reducing time horizon in power and energy studies. The proposed method displays smaller errors, is more scalable, and has less impact on system operation compared to conventional approaches.

RENEWABLE & SUSTAINABLE ENERGY REVIEWS (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Learning robust graph for clustering

Zheng Liu, Wei Jin, Ying Mu

Summary: This paper introduces a unified framework that incorporates robust graph learning and dimensionality reduction, as well as clustering task. Two robust graph methods based on Euclidean distance and self-expressiveness are proposed, which are informative, robust, and sparse. Extensive experiments demonstrate their advantages in the task of clustering.

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS (2022)

Add to Collection

Article Environmental Sciences

Unsupervised band selection of hyperspectral data based on mutual information derived from weighted cluster entropy for snow classification

Divyesh Varade, Ajay K. Maurya, Onkar Dikshit

Summary: Information on snow cover distribution is important in hydrological processes and climate models. Hyperspectral remote sensing provides opportunities in land cover assessment, but is limited in snow-covered alpine regions due to large dimensionality. A band selection technique based on mutual information is proposed to improve efficiency and accuracy in selecting informative bands.

GEOCARTO INTERNATIONAL (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Feature Selection for Classification using Principal Component Analysis and Information Gain

Erick Odhiambo Omuya, George Onyango Okeyo, Michael Waema Kimwele

Summary: This study investigates the application of feature selection and classification in various fields, addressing the challenges of high dimensionality in datasets and the negative impact of irrelevant and redundant attributes on classification algorithms. To improve classification performance, a hybrid filter model based on principal component analysis and information gain is proposed and applied to machine learning techniques, demonstrating enhanced accuracy, precision, and recall.

EXPERT SYSTEMS WITH APPLICATIONS (2021)

Add to Collection

Article Computer Science, Software Engineering

An appearance-driven space to create new BRDFs

Mislene da Silva Nunes, Gastao Florencio Miranda Junior, Beatriz Trinchao Andrade

Summary: The search for realism in renderings has led to an increased interest in tabular BRDF samples captured from real-world materials. This study proposes an approach to generate new BRDFs based on user-selected materials from a database, creating an appearance-driven space using dimensionality reduction and clustering techniques.

COMPUTERS & GRAPHICS-UK (2022)

Add to Collection

Article Automation & Control Systems

Validating Clustering Frameworks for Electric Load Demand Profiles

Mayank Jain, Tarek AlSkaif, Soumyabrata Dev

Summary: This article introduces a novel scheme to objectively validate and compare the clustering results of residential electric demand profiles, considering all steps prior to the clustering algorithm. Compared to traditional clustering validity indices, the proposed scheme provides better, unbiased, and uniform recommendations.

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS (2021)

Add to Collection

Article Agriculture, Multidisciplinary

Potential of functional analysis applied to Sentinel-2 time-series to assess relevant agronomic parameters at the within-field level in viticulture

Sergio Velez, Florian Rancon, Enrique Barajas, Guilhem Brunel, Jose Antonio Rubio, Bruno Tisseyre

Summary: This study utilizes Sentinel-2 satellite imagery to extract relevant information from two vineyards in Spain. By employing dimensionality reduction techniques such as Principal Component Analysis (PCA) and Partial Least Square (PLS), the NDVI time-series are decomposed into multiple functional components. The results demonstrate the added value of considering the entire time-series compared to a single image, and establish correlations with seasonal phenology and management practices in the vineyards.

COMPUTERS AND ELECTRONICS IN AGRICULTURE (2022)

Add to Collection

Article Engineering, Electrical & Electronic

Self-Supervised Symmetric Nonnegative Matrix Factorization

Yuheng Jia, Hui Liu, Junhui Hou, Sam Kwong, Qingfu Zhang

Summary: This paper introduces a self-supervised symmetric nonnegative matrix factorization (SNMF) method to improve data clustering performance. By exploiting the sensitivity to initialization of SNMF, without relying on additional information, the method progressively enhances clustering results. Experimental results demonstrate its superiority over 14 state-of-the-art methods in terms of multiple quantitative metrics.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Add to Collection

Article Ecology

Ecological prediction at macroscales using big data: Does sampling design matter?

Patricia A. Soranno, Kendra Spence Cheruvelil, Boyang Liu, Qi Wang, Pang-Ning Tan, Jiayu Zhou, Katelyn B. S. King, Ian M. McCullough, Jemma Stachelek, Meridith Bartley, Christopher T. Filstrup, Ephraim M. Hanks, Jean-Francois Lapierre, Noah R. Lottig, Erin M. Schliep, Tyler Wagner, Katherine E. Webster

ECOLOGICAL APPLICATIONS (2020)

Add to Collection

Article Health Care Sciences & Services

Improving Heart Disease Risk Through Quality-Focused Diet Logging: Pre-Post Study of a Diet Quality Tracking App

Bum Chul Kwon, Courtland VanDam, Stephanie E. Chiuve, Hyung Wook Choi, Paul Entler, Pang-Ning Tan, Jina Huh-Yoo

JMIR MHEALTH AND UHEALTH (2020)

Add to Collection

Article Computer Science, Artificial Intelligence

Spatio-Temporal Multi-Task Learning via Tensor Decomposition

Jianpeng Xu, Jiayu Zhou, Pang-Ning Tan, Xi Liu, Lifeng Luo

Summary: Predictive modeling of large-scale spatio-temporal data is a challenging problem that requires training models to predict target variables at multiple locations while preserving spatial and temporal dependencies. This paper explores the effectiveness of using supervised tensor decomposition for multi-task learning in spatio-temporal prediction. The proposed framework, SMART, encodes data as a third-order tensor and trains ensemble models based on interpretable latent factors extracted from the data to make predictions on test instances, incorporating known patterns as constraints.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2021)

Add to Collection

Article Public, Environmental & Occupational Health

Using Machine Learning to Compare Provaccine and Antivaccine Discourse Among the Public on Social Media: Algorithm Development Study

Young Anna Argyris, Kafui Monu, Pang-Ning Tan, Colton Aarts, Fan Jiang, Kaleigh Anne Wiseley

Summary: The study compared the discursive topics chosen by pro- and antivaccine advocates in influencing the public, finding that antivaccine topics have greater intertopic distinctiveness and use all four message frames, while provaccine advocates have neglected having a clear problem statement.

JMIR PUBLIC HEALTH AND SURVEILLANCE (2021)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

DeepGPD: A Deep Learning Approach for Modeling Geospatio-Temporal Extreme Events

Tyler Wilson, Pang-Ning Tan, Lifeng Luo

Summary: This paper presents a deep learning framework for long-term prediction of the distribution of extreme values at different locations and addresses the computational challenges associated with large-scale geospatio-temporal data.

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE (2022)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

Unsupervised Anomaly Detection by Robust Density Estimation

Boyang Liu, Pang-Ning Tan, Jiayu Zhou

Summary: This paper proposes a robust deep density estimation framework for unsupervised anomaly detection, which improves the performance by discarding data points with low estimated densities and applying Lipschitz regularization.

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE (2022)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

FACS-GCN: Fairness-Aware Cost-Sensitive Boosting of Graph Convolutional Networks

Francisco Santos, Junke Ye, Farzan Masrour, Pang-Ning Tan, Abdol-Hossein Esfahanian

Summary: Graph neural networks (GNNs) are widely used for modeling graph data by integrating node attributes and link information into concise representations. However, node classification using GNNs faces challenges such as imbalanced class distribution and the bias caused by the homophily effect. To address these challenges, we propose a novel framework called Fairness-Aware Cost Sensitive Graph Convolutional Network (FACS-GCN) that combines a cost-sensitive exponential loss and adversarial learning to achieve fair classification.

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) (2022)

Add to Collection

Proceedings Paper Computer Science, Interdisciplinary Applications

Using Deep Learning to Identify Linguistic Features that Facilitate or Inhibit the Propagation of Anti- and Pro-Vaccine Content on Social Media

Young Anna Argyris, Nan Zhang, Bidhan Bashyal, Pang-Ning Tan

Summary: This study aims to investigate the linguistic features of vaccine-related content and their impact on propagation, identifying two sets of features that either facilitate or inhibit the spread of vaccine-related tweets. Results show that anti-vaccine tweets tend to be propagated through retweets, while pro-vaccine tweets mainly receive passive endorsements.

2022 IEEE INTERNATIONAL CONFERENCE ON DIGITAL HEALTH (IEEE ICDH 2022) (2022)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

JOHAN: A Joint Online Hurricane Trajectory and Intensity Forecasting Framework

Ding Wang, Pang-Ning Tan

Summary: This paper introduces a novel online learning framework called JOHAN, which simultaneously predicts the trajectory and intensity of hurricanes, generates accurate forecasts of hurricane intensity categories, and uses exponentially-weighted quantile loss functions to improve prediction accuracy for high category hurricanes approaching landfall. Experimental results show the superiority of JOHAN over several state-of-the-art learning approaches using real-world hurricane data.

KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING (2021)

Add to Collection

Article Computer Science, Information Systems

Automated Analysis of the US Drought Monitor Maps With Machine Learning and Multiple Drought Indicators

Pouyan Hatami Bahman Beiglou, Lifeng Luo, Pang-Ning Tan, Lisi Pei

Summary: The US Drought Monitor is a vital tool for real-time drought monitoring, but its production involves human judgment, making it difficult for others to reproduce the maps. This study developed a framework using machine learning to automatically generate similar maps, with the support vector machines algorithm and specific data group achieving near-perfect reproduction accuracy.

FRONTIERS IN BIG DATA (2021)

Add to Collection

Article Communication

Trick or Drink: Offline and Social Media Hierarchical Normative Influences on Halloween Celebration Drinking

Saleem Alhabash, Duygu Kanver, Chen Lou, Sandi W. Smith, Pang-Ning Tan

Summary: The study found that underage youth's perception of societal and personal celebration drinking norms were related to their close friends' drinking norms, which influenced their alcohol consumption during Halloween. Additionally, social media posting and interaction with alcohol-related content were associated with greater descriptive normative perceptions and self-reported drinking.

HEALTH COMMUNICATION (2021)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

Fairness Perception from a Network-Centric Perspective

Farzan Masrour, Pang-Ning Tan, Abdol-Hossein Esfahanian

20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2020) (2020)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

Convolutional Methods for Predictive Modeling of Geospatial Data

Tyler Wilson, Pang-Ning Tan, Lifeng Luo

PROCEEDINGS OF THE 2020 SIAM INTERNATIONAL CONFERENCE ON DATA MINING (SDM) (2020)

Add to Collection

Article Limnology

Increasing accuracy of lake nutrient predictions in thousands of lakes by leveraging water clarity data

Tyler Wagner, Noah R. Lottig, Meridith L. Bartley, Ephraim M. Hanks, Erin M. Schliep, Nathan B. Wikle, Katelyn B. S. King, Ian McCullough, Jemma Stachelek, Kendra S. Cheruvelil, Christopher T. Filstrup, Jean Francois Lapierre, Boyang Liu, Patricia A. Soranno, Pang-Ning Tan, Qi Wang, Katherine Webster, Jiayu Zhou

LIMNOLOGY AND OCEANOGRAPHY LETTERS (2020)

Add to Collection

Article Communication

Celebration Drinking around the Clock

Sandi W. Smith, Saleem Alhabash, Duygu Kanver, Pang-Ning Tan, Greg Viken

HEALTH COMMUNICATION (2020)

Add to Collection

No Data Available

© Peeref 2019-2024. All rights reserved.