☆ 3.9 Article

Knowledge discovery through directed probabilistic topic models: a survey

FRONTIERS OF COMPUTER SCIENCE IN CHINA (2010)

Journal

FRONTIERS OF COMPUTER SCIENCE IN CHINA

Volume 4, Issue 2, Pages 280-301

Publisher

HIGHER EDUCATION PRESS

DOI: 10.1007/s11704-009-0062-y

Keywords

text corpora; Directed Probabilistic Topic Models (DPTMs); soft clustering; unsupervised learning; knowledge discovery

Categories

Computer Science, Information Systems Computer Science, Software Engineering Computer Science, Theory & Methods

Funding

National Natural Science Foundation of China [90604025, 60703059]
Chinese National Key Foundation Research and Development Plan [2007CB310803]
Higher Education Commission (HEC), Pakistan

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Graphical models have become the basic framework for topic based probabilistic modeling. Especially models with latent variables have proved to be effective in capturing hidden structures in the data. In this paper, we survey an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora. From an unsupervised learning perspective, topics are semantically related probabilistic clusters of words in text corpora; and the process for finding these topics is called topic modeling. In topic modeling, a document consists of different hidden topics and the topic probabilities provide an explicit representation of a document to smooth data from the semantic level. It has been an active area of research during the last decade. Many models have been proposed for handling the problems of modeling text corpora with different characteristics, for applications such as document classification, hidden association finding, expert finding, community discovery and temporal trend analysis. We give basic concepts, advantages and disadvantages in a chronological order, existing models classification into different categories, their parameter estimation and inference making algorithms with models performance evaluation measures. We also discuss their applications, open challenges and future directions in this dynamic area of research.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.9

Not enough ratings

Secondary Ratings

Novelty

-

Significance

-

Scientific rigor

-

Rate this paper

Recommended

Article Computer Science, Artificial Intelligence

A topic modeled unsupervised approach to single document extractive text summarization

Ridam Srivastava, Prabhav Singh, K. P. S. Rana, Vineet Kumar

Summary: Automatic Text Summarization (ATS) is an essential field in natural language processing that helps condense large text documents for users to quickly assimilate information. This study proposed an unsupervised extractive summarization approach combining clustering with topic modeling, which outperformed similar recent works.

KNOWLEDGE-BASED SYSTEMS (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

RankSum-An unsupervised extractive text summarization based on rank fusion

Akanksha Joshi, Eduardo Fidalgo, Enrique Alegre, Rocio Alaiz-Rodriguez

Summary: This paper proposes an unsupervised text summarization approach called Ranksum, which utilizes four sentence features for ranking and fusion. Experimental results show that Ranksum outperforms other existing methods.

EXPERT SYSTEMS WITH APPLICATIONS (2022)

Add to Collection

Article Computer Science, Information Systems

Unsupervised Graph-Based Tibetan Multi-Document Summarization

Xiaodong Yan, Yiqin Wang, Wei Song, Xiaobing Zhao, A. Run, Yang Yanxing

Summary: This paper proposes an unsupervised graph-based Tibetan multi-document summarization method that divides a large number of Tibetan news documents into topics and extracts the summarization of each topic. The experiment results show that our method can effectively improve the quality of summarization and our method is competitive to previous unsupervised methods.

CMC-COMPUTERS MATERIALS & CONTINUA (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Hierarchical Bayesian text modeling for the unsupervised joint analysis of latent topics and semantic clusters

Gianni Costa, Riccardo Ortale

Summary: This manuscript proposes two innovative approaches for simultaneously conducting topic modeling and document clustering tasks. The effectiveness of these approaches is demonstrated through a comparative empirical evaluation, which also uncovers the underlying semantics of text collections.

INTERNATIONAL JOURNAL OF APPROXIMATE REASONING (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling

Nabil Alami, Mohammed Meknassi, Noureddine En-nahnahi, Yassine El Adlouni, Ouafae Ammor

Summary: The paper discusses the challenges in Arabic text summarization and proposes a new approach utilizing document clustering, topic modeling, and unsupervised neural networks to build an efficient document representation model. Experimental results show that the proposed approach outperforms other Arabic text summarization methods, with significant improvements in summarization performance demonstrated particularly by ensemble learning models.

EXPERT SYSTEMS WITH APPLICATIONS (2021)

Add to Collection

Article Transportation Science & Technology

Unsupervised hierarchical methodology of maritime traffic pattern extraction for knowledge discovery

Huanhuan Li, Jasmine Siu Lee Lam, Zaili Yang, Jingxian Liu, Ryan Wen Liu, Maohan Liang, Yan Li

Summary: This study develops an unsupervised methodology for feature extraction and knowledge discovery based on AIS data to support trajectory data mining and improve maritime traffic safety. The methodology includes trajectory compression, similarity measure, and trajectory clustering, effectively extracting vessel traffic behavior characteristics and navigation knowledge.

TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES (2022)

Add to Collection

Article Computer Science, Information Systems

Performance evaluation of text-mining models with Hindi stopwords lists

Ruby Rani, D. K. Lobiyal

Summary: This paper attempts to construct corpus specific stopwords lists for Hindi text documents using statistical and knowledge-based methods, and proposes an evaluation method to examine their behavior using text mining models.

JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES (2022)

Add to Collection

Review Mathematical & Computational Biology

Unsupervised learning for medical data: A review of probabilistic factorization methods

Dorien Neijzen, Gerton Lunter

Summary: This article reviews popular unsupervised learning methods for analyzing high-dimensional data encountered in various fields. It shows that these methods, including principal component analysis, K-means clustering, nonnegative matrix factorization, and latent Dirichlet allocation, can be considered as probabilistic models based on low-rank matrix factorization. This formulation not only highlights their similarities but also clarifies the assumptions and restrictions of each method, making it easier for applied medical researchers to choose the appropriate method. The article also touches upon the important aspects of inference and model selection when applying these methods to health data.

STATISTICS IN MEDICINE (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

A multi-grained aspect vector learning model for unsupervised aspect identification

Jinglei Shi, Junjun Guo, Zhengtao Yu, Yan Xiang

Summary: In this study, we propose an unsupervised aspect identification model based on aspect vector reconstruction, which establishes connections between sentence vectors and multi-grained aspect vectors for better aspect representation learning. Experimental results demonstrate that the model outperforms some baselines in aspect identification and topic coherence of extracted aspect terms.

JOURNAL OF INTELLIGENT & FUZZY SYSTEMS (2021)

Add to Collection

Article Geochemistry & Geophysics

Unsupervised Remote Sensing Image Retrieval Using Probabilistic Latent Semantic Hashing

Ruben Fernandez-Beltran, Begum Demir, Filiberto Pla, Antonio Plaza

Summary: In this letter, a novel unsupervised hashing method based on probabilistic topic models is introduced to encapsulate hidden semantic patterns of data into final binary representation. The method effectively learns hash codes through three main steps: data grouping, topic computation, and hash code generation. Experimental results on benchmark archives show that the proposed method significantly outperforms state-of-the-art unsupervised hashing methods.

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS (2021)

Add to Collection

Article Computer Science, Information Systems

A two-stage unsupervised sentiment analysis method

Yingqi Wang, Hongyu Han, Xin He, Rui Zhai

Summary: In this paper, the SASC (Sentiment Analysis based on Sentiment Clustering) method is proposed to address the issues of low accuracy and poor stability in review sentiment clustering methods. By utilizing two-stage sentiment clustering, hidden sentiment information in review texts is captured to enhance the accuracy and stability of the results. Specifically, the first stage introduces a review representation vector construction method using LDA topic model, while the second stage employs K-means algorithm for further optimization of sentiment clustering results. Experimental results on widely used datasets showcase that the SASC method outperforms other methods in terms of clustering accuracy and stability.

MULTIMEDIA TOOLS AND APPLICATIONS (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

Comparison study of unsupervised paraphrase detection: Deep learning-The key for semantic similarity detection

Tedo Vrbanec, Ana Mestrovic

Summary: Automatic detection of concealed plagiarism in the form of paraphrases is a difficult task. This study identifies the most efficient methods for unsupervised paraphrase detection, including using similarity measures alone or combined with deep learning models. The results show that some deep learning models outperform the best statistical methods, making concealed plagiarism detection achievable.

EXPERT SYSTEMS (2023)

Add to Collection

Article Computer Science, Theory & Methods

Contextual topic discovery using unsupervised keyphrase extraction and hierarchical semantic graph model

Hung Du, Srikanth Thudumu, Antonio Giardina, Rajesh Vasa, Kon Mouzakis, Li Jiang, John Chisholm, Sanat Bista

Summary: This paper presents a hybrid unsupervised keyphrase extraction technique called ContextualRank, which embeds contextual information in the keyphrase extraction process. It proposes a hierarchical topic modeling approach for topic discovery based on aggregating the extracted keyphrases from ContextualRank. The evaluation results demonstrate remarkable performance improvements over other baselines.

JOURNAL OF BIG DATA (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

A topic discovery approach for unsupervised organization of legal document collections

Daniela Vianna, Edleno Silva de Moura, Altigran Soares da Silva

Summary: Technology has changed the way legal services operate in many countries. This study focuses on organizing and summarizing the growing collection of legal documents, uncovering hidden topics for legal case retrieval and prediction. The proposed approach combines topic discovery techniques, preprocessing methods, and learning-based vector representations. Validation on Portuguese legal documents showed the effectiveness of the method in uncovering relevant and unique topics, supporting legal case retrieval tools and aiding legal specialists in labeling/tagging documents.

ARTIFICIAL INTELLIGENCE AND LAW (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

Learning to Purification for Unsupervised Person Re-Identification

Long Lan, Xiao Teng, Jing Zhang, Xiang Zhang, Dacheng Tao

Summary: In this study, an unsupervised person re-identification method is proposed, which has achieved great progress by training with pseudo labels. To purify the feature and label noise, multi-view features and the knowledge of a teacher model are utilized. Experimental results demonstrate the effectiveness of this approach for unsupervised person re-identification.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2023)

Add to Collection

Article Information Science & Library Science

Investigating the citation advantage of author-pays charges model in computer science research: a case study of Elsevier and Springer

Tehmina Amjad, Mehwish Sabir, Azra Shamim, Masooma Amjad, Ali Daud

Summary: This study compared the citation advantage of open access and toll access articles in four subfields of computer science, finding that open access articles have a higher citation advantage and the advantage varies among different subfields. The results validate the positive movement towards open access articles in the field of computer science.

LIBRARY HI TECH (2022)

Add to Collection

Article Multidisciplinary Sciences

Measuring the impact of COVID-19 surveillance variables over the international oil market

Abdulrahman A. Alshdadi, Malik Khizar Hayat, Ali Daud, Ameen Banjar, Hussain Dawood

Summary: The COVID-19 pandemic has had a significant impact on the international oil market, causing fluctuations in crude oil prices and triggering a global economic crisis. This study aims to investigate the short-term and long-term effects of COVID-19 on the international oil market by analyzing the correlation between surveillance variables and international crude oil prices. The findings will provide important guidance for policymakers in the oil market.

INTERNATIONAL JOURNAL OF ADVANCED AND APPLIED SCIENCES (2022)

Add to Collection

Article Computer Science, Information Systems

Reduction of random-valued impulse noise by using multi-structured textons

Hussain Dawood, Ali Daud, Hassan Dawood, Marium Azhar

Summary: This paper presents an iterative two-stage image denoising technique based on multi-structured textons for the denoising of random-valued impulse noise. The proposed method identifies noisy pixels using multiple textons and restores noise-free pixels using spatially linked directional similarity. Experimental results demonstrate the superiority of the proposed method in denoising performance.

MULTIMEDIA TOOLS AND APPLICATIONS (2022)

Add to Collection

Article Computer Science, Interdisciplinary Applications

Citation burst prediction in a bibliometric network

Tehmina Amjad, Nafeesa Shahid, Ali Daud, Asma Khatoon

Summary: This study aims to investigate the impact of several features on the number of citations for articles published in journals or conferences, as well as to predict future citations. The findings show that for journal publications, author first-year citations and author total citation are the most important features, while author total citation is more effective for conference publications.

SCIENTOMETRICS (2022)

Add to Collection

Article Computer Science, Interdisciplinary Applications

Indexing important drugs from medical literature

Riad Alharbey, Jong In Kim, Ali Daud, Min Song, Abdulrahman A. Alshdadi, Malik Khizar Hayat

Summary: Health maintenance is crucial for society, and the progress in biomedical field has led to a wealth of medical information. Extracting meaningful insights, especially related to gene-drug relationships, is important for recent medicine. This study proposes a new measure, Drug-Index, to detect gene-drug relations, which is useful for drug discovery, diagnoses, and personalized treatment.

SCIENTOMETRICS (2022)

Add to Collection

Article Automation & Control Systems

DBP-DeepCNN: Prediction of DNA-binding proteins using wavelet-based denoising and deep learning

Farman Ali, Harish Kumar, Shruti Patil, Aftab Ahmed, Ameen Banjar, Ali Daud

Summary: In this study, a deep learning-based predictor (DBP-DeepCNN) is proposed to improve the prediction of DNA-binding proteins (DBPs). By using a novel feature extraction method and training with various models, the predictor achieved higher accuracies on both training and independent datasets, indicating its potential for large scale DBP prediction and promising therapeutic strategies for chronic diseases.

CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS (2022)

Add to Collection

Article Automation & Control Systems

iDBP-PBMD: A machine learning model for detection of DNA-binding proteins by extending compression techniques into evolutionary profile

Ameen Banjar, Farman Ali, Omar Alghushairy, Ali Daud

Summary: DNA-binding proteins (DBPs) play crucial roles in DNA transcription, recombination, and replication, and are associated with diseases like AIDS/HIV, cancer, and asthma. This research encoded DBPs using different feature descriptors and eliminated noisy and redundant features using compression techniques. The resulting features were used to train models with XGBoost and ERT classifiers. The study demonstrated the superiority of this approach over previous methods.

CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS (2022)

Add to Collection

Article Automation & Control Systems

Comparative analysis of the existing methods for prediction of antifreeze proteins

Adnan Khan, Jamal Uddin, Farman Ali, Ameen Banjar, Ali Daud

Summary: Antifreeze proteins (AFPs) are found in various organisms and play a crucial role in preventing the formation of ice crystals. The development of accurate predictors for identifying AFPs is essential. This review article provides a comprehensive summary of existing AFP predictors, including their applied datasets, feature descriptors, model training classifiers, performance assessment parameters, and web servers. The drawbacks of current predictors are highlighted, and suggestions for future improvements, such as more effective feature descriptors and efficient classifiers, are discussed.

CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS (2023)

Add to Collection

Article Computer Science, Cybernetics

Identifying Rising Stars via Supervised Machine Learning

Ali Daud, Naveed ul Islam, Xin Li, Imran Razzak, Malik Khizar Hayat

Summary: Identifying rising stars is important for the growth of any organization. This article explores the classification of rising business managers (RBMs) by examining the features of co-business managers (Co-BMs), using machine learning techniques. Experimental results show that generative models, particularly Bayesian networks, produce better predictions for the dataset based on average revenue.

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS (2023)

Add to Collection

Article Information Science & Library Science

OpenRank - a novel approach to rank universities using objective and publicly verifiable data sources

Muhammad Sajid Qureshi, Ali Daud, Malik Khizar Hayat, Muhammad Tanvir Afzal

Summary: This research aims to enhance the credibility of academic rankings by using objective indicators based on publicly verifiable data sources. The proposed ranking methodology, OpenRank, uses objective indicators from two well-known data repositories, ArnetMiner and DBpedia. The resulting academic ranking reflects common tendencies of international rankings. Evaluation of the methodology shows its effectiveness and reproducibility with low data collection cost.

LIBRARY HI TECH (2023)

Add to Collection

Article Computer Science, Cybernetics

Citation Count Is Not Enough: Citation's Context-Based Scientific Impact Evaluation

Ali Daud, Sehrish Ghaffar, Tehmina Amjad

Summary: Qualitative analysis of citations received by a scientific manuscript is challenging. Most existing approaches for scientific impact evaluation only use quantitative parameters, such as the number of citations, and ignore the qualitative feature of citation context. In this study, a context-based article impact factor (CBAIF) is proposed to evaluate articles based on the context of citations, considering positive, negative, or neutral contexts and the conflict-of-interest relationship between citing and cited authors. Experimental results show that CBAIF provides more accurate rankings compared to the article impact factor (AIF) without considering the context of citations.

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS (2022)

Add to Collection

Article Computer Science, Information Systems

Prediction of Movie Quality via Adaptive Voting Classifier

Muhammad Shahzad Faisal, Atif Rizwan, Khalid Iqbal, Heba Fasihuddin, Ameen Banjar, Ali Daud

Summary: This paper discusses the challenges of information retrieval from social web data and proposes a method to predict high-quality/popular movies using various features. Additionally, an enhanced optimization-based voting classifier is introduced to improve the performance of the proposed features.

IEEE ACCESS (2022)

Add to Collection

Article Information Science & Library Science

Measuring the impact of co-author count on citation count of research publications

Ali Daud, Malik Khizar Hayat, Abdulrahman A. Alshdadi, Ameen Banjar, Wael Mansour Alharbi

Summary: Co-authored research work has higher visibility and impact compared to individual published work. This study analyzes the correlation between the number of co-authors in a published paper and the number of times the paper is cited. The analysis is divided into three categories and the results show that most research fields have increasing citability with a greater number of co-authors.

COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (2022)

Add to Collection

Article Computer Science, Information Systems

Ontological Modeling and Semantic Search in Quran

Ali Daud, Muhammad Hafeez Ullah, Ameen Reda Banjar, Abdulrahman A. Alshdadi

Summary: This paper introduces an ontology development method considering Quran, Hadith, and Tafsir, and performs semantic search on Zakat as a use case. The results show that the proposed method meets the expectations.

INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY (2022)

Add to Collection

Article Computer Science, Cybernetics

Advanced Learning Analytics: Aspect Based Course Feedback Analysis of MOOC Forums to Facilitate Instructors

Tehmina Amjad, Zainab Shaheen, Ali Daud

Summary: The use of Massive Online Open Courses (MOOCs) has increased significantly in recent times, particularly after the COVID-19 pandemic. To address the lack of face-to-face interaction, MOOC platforms provide a discussion forum for students to share their thoughts and problems. Instructors must closely monitor student performance and analyze discussion threads to identify specific problem areas. This study proposes a method that categorizes threads using topic modeling and performs sentiment analysis on comments to improve teaching methodology and enhance student understanding.

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS (2022)

Add to Collection

No Data Available

© Peeref 2019-2024. All rights reserved.