☆ 4.5 Article

A simple and effective outlier detection algorithm for categorical data

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS (2014)

Journal

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS

Volume 5, Issue 3, Pages 469-477

Publisher

SPRINGER HEIDELBERG

DOI: 10.1007/s13042-013-0202-4

Keywords

Outlier detection; Categorical data; Weighted density; Information entropy

Categories

Computer Science, Artificial Intelligence

Funding

National Natural Science Foundation of China [71031006]
Foundation of Doctoral Program Research of Ministry of Education of China [20101401110002]
Construction Project of the Science and Technology Basic Condition Platform of Shanxi Province [2012091002-0101]
Shanxi Scholarship Council of China [2013-101]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Outlier detection is an important data mining task that has attracted substantial attention within diverse research communities and the areas of application. By now, many techniques have been developed to detect outliers. However, most existing research focus on numerical data. And they can not directly apply to categorical data because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, a weighted density definition is given firstly, which takes account of the density and uncertainty of objects in every attributes simultaneously. Furthermore, a simple and effective outlier detection algorithm for categorical data based on the given weighted density is proposed. The corresponding time complexity of the algorithm is analyzed as well. Experimental results on real and synthetic data sets demonstrate the effectiveness and efficiency of our proposed algorithm.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5

Not enough ratings

Secondary Ratings

Novelty

-

Significance

-

Scientific rigor

-

Rate this paper

Recommended

Article Engineering, Multidisciplinary

FAST-ODT: A Lightweight Outlier Detection Scheme for Categorical Data Sets

Hongwei Du, Qiang Ye, Zhipeng Sun, Chuang Liu, Wen Xu

Summary: This study introduces two novel outlier detection algorithms for categorical data sets: Outlier Detection Tree (ODT) and FAST-ODT. ODT uses a classification tree and if-then rules to detect outliers in categorical data, while FAST-ODT achieves high detection accuracy with low time complexity.

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Attribute-weighted outlier detection for mixed data based on parallel mutual information

Junli Li, Zhanfeng Liu

Summary: Outlier detection plays a crucial role in data mining. However, most existing algorithms focus on either numerical or categorical attributes and neglect the mixture of attributes commonly found in real-world data. In this study, we propose a high-dimensional and massive mixed data outlier detection algorithm called PMIOD, which incorporates attribute weighting using mutual information. We also parallelize the mutual information computation on the Spark platform to improve efficiency. Experimental results on various datasets demonstrate the superior performance of the proposed algorithm.

EXPERT SYSTEMS WITH APPLICATIONS (2024)

Add to Collection

Article Computer Science, Artificial Intelligence

An outlier detection algorithm for categorical matrix-object data

Fuyuan Cao, Xiaolin Wu, Liqin Yu, Jiye Liang

Summary: This paper proposes an outlier detection algorithm for matrix-object data sets, which describes and calculates the outlier factor of matrix objects based on their coupling and cohesion. Experimental results have shown that the proposed algorithm effectively detects outliers compared to other algorithms on real and synthetic data sets.

APPLIED SOFT COMPUTING (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Feature selection considering interaction, redundancy and complementarity for outlier detection in categorical data

Lianxi Wang, Yubing Ke

Summary: This paper proposes a feature selection method for outlier detection in categorical data, taking into account the feature relevance, interaction, redundancy, and complementarity. Experimental results demonstrate that the proposed method outperforms five other state-of-the-art feature selection methods on 14 real-world datasets.

KNOWLEDGE-BASED SYSTEMS (2023)

Add to Collection

Article Physics, Multidisciplinary

An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data

Zihao Li, Liumei Zhang

Summary: This paper proposes a new outlier detection algorithm called EOEH, which improves the detection performance of high-dimensional data by utilizing random subsampling and information entropy-weighted subspaces. Through experiments, it is demonstrated that EOEH algorithm outperforms popular outlier detection algorithms in terms of detection performance and runtime efficiency.

ENTROPY (2023)

Add to Collection

Article Computer Science, Information Systems

New uncertainty measurement for categorical data based on fuzzy information structures: An application in attribute reduction

Qinli Zhang, Yiying Chen, Gangqiang Zhang, Zhaowen Li, Lijun Chen, Ching-Feng Wen

Summary: The paper discusses the handling of categorical data in machine learning, introducing fuzzy information structures and new uncertainty measurements for considering the equality of attribute values. Numerical experiments and statistical tests were conducted to evaluate the performance of the proposed measurements, showing that they outperform traditional measurements based on I-structures. Furthermore, attribute reduction algorithms based on the new measurements were presented and tested in clustering analysis, showing effective performance in reducing attributes.

INFORMATION SCIENCES (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Homophily outlier detection in non-IID categorical data

Guansong Pang, Longbing Cao, Ling Chen

Summary: This study introduces a novel outlier detection framework to identify outliers in categorical data by capturing non-IID outlier factors. The graph representation and mining approach is employed to well capture the rich non-IID characteristics.

DATA MINING AND KNOWLEDGE DISCOVERY (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Coupling learning for feature selection in categorical data

Feng Wang, Jiye Liang, Peng Song

Summary: Feature selection is a widely used data preprocessing technique to improve model performance and efficiency. However, traditional approaches assume that data are independent and identically distributed (IID). This paper introduces new coupled similarity and relevance measures to capture coupling relationships between feature values and features. Based on coupling learning, an effective feature-selection algorithm for categorical data is developed and validated using common classifiers and UCI datasets.

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS (2023)

Add to Collection

Article Computer Science, Information Systems

Incomplete mixed data-driven outlier detection based on local-global neighborhood information

Ran Li, Hongchang Chen, Shuxin Liu, Xing Li, Yingle Li, Biao Wang

Summary: Outlier detection is a challenging task due to the nature of ubiquitous, incomplete, redundant, noisy, and mixed data. To address this challenge, this paper proposes an ILGNI network that considers both local and global information from incomplete mixed data. The network enhances connectivity between similar objects and weakens connectivity between heterogeneous objects, allowing for efficient graph-based outlier detection. Experiments on telecom fraud datasets demonstrate that the proposed algorithm achieves enhanced outlier detection performance with low time complexity and is applicable to various types of datasets.

INFORMATION SCIENCES (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

A density estimation approach for detecting and explaining exceptional values in categorical data

Fabrizio Angiulli, Fabio Fassetti, Luigi Palopoli, Cristina Serrao

Summary: This work focuses on the detection and explanation of anomalous values in categorical datasets. The authors propose the concept of frequency occurrence and an outlierness measure for identifying lower and upper outliers. They also provide interpretable explanations and a mechanism for selecting outstanding explanations.

APPLIED INTELLIGENCE (2022)

Add to Collection

Article Multidisciplinary Sciences

Cheap robust learning of data anomalies with analytically solvable entropic outlier sparsification

Illia Horenko

Summary: Entropic outlier sparsification (EOS) is a cheap and robust computational strategy for learning in the presence of data anomalies and outliers. EOS solves the expected loss minimization problem with Shannon entropy regularization, providing a closed-form solution that incurs additional costs linearly dependent on statistics size and independent of data dimension. The results explain the optimality of using mixtures of spherically symmetric Gaussians for nonparametric probability distributions in algorithms working with squared Euclidean distances. Experimental results demonstrate that applying EOS to biomedical problems enables accurate prediction of patient mortality after heart failure, outperforming common learning tools.

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

A neighborhood weighted-based method for the detection of outliers

Zhong-Yang Xiong, Hua Long, Yu-Fang Zhang, Xiao-Xia Wang, Qin-Qin Gao, Lin-Tao Li, Min Zhang

Summary: Outlier detection is an important research direction in data mining, and most existing methods are not suitable for complex patterns. To address this, we propose a neighborhood weighted-based outlier detection algorithm that measures the local density of objects using a weighted nearest neighbor graph, and compares the differences in neighborhood weighted local density to determine the degree of being an outlier.

APPLIED INTELLIGENCE (2023)

Add to Collection

Article Physics, Multidisciplinary

Categorical Nature of Major Factor Selection via Information Theoretic Measurements

Ting-Li Chen, Elizabeth P. P. Chou, Hsieh Fushing

Summary: This research selects collections of major factors embedded within response-versus-covariate dynamics based on information theoretic measurements through Categorical Exploratory Data Analysis (CEDA) computing paradigm, exploring the relevance to Wiener-Granger causality. The selection task identifies a chief collection and several secondary collections, with reliability checks through algorithmic computations.

ENTROPY (2021)

Add to Collection

Article Computer Science, Information Systems

Outlier Detection of Mixed Data Based on Neighborhood Combinatorial Entropy

Lina Wang, Qixiang Zhang, Xiling Niu, Yongjun Ren, Jinyue Xia

Summary: Outlier detection is a crucial area in data mining, aiming to identify inconsistencies in data sets. By reducing data dimensions to enhance performance and effectively applying in numerical and mixed multidimensional data, the proposed method has the potential to improve outlier detection accuracy.

CMC-COMPUTERS MATERIALS & CONTINUA (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

A double-weighted outlier detection algorithm considering the neighborhood orientation distribution of data objects

Qiang Gao, Qin-Qin Gao, Zhong-Yang Xiong, Yu-Fang Zhang, Yu-Qin Wang, Min Zhang

Summary: This paper conducts in-depth research on the problems of low-density pattern and local outliers detection in outlier detection algorithms and proposes a double-weighted algorithm considering the dense direction. The algorithm explores the relationship between data points and their neighbor distribution by considering distance and orientation, designs new point weighting and edge weighting strategies, and achieves better representation of the potential structural information inside the data.

APPLIED INTELLIGENCE (2023)

Add to Collection

No Data Available

No Data Available

© Peeref 2019-2024. All rights reserved.