☆ 4.2 Article

Exploring incomplete data using visualization techniques

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION (2012)

Journal

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION

Volume 6, Issue 1, Pages 29-47

Publisher

SPRINGER HEIDELBERG

DOI: 10.1007/s11634-011-0102-y

Keywords

Visualization; Missing values; Exploring incomplete data; R software

Categories

Statistics & Probability

Funding

European Union [217322]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Visualization of incomplete data allows to simultaneously explore the data and the structure of missing values. This is helpful for learning about the distribution of the incomplete information in the data, and to identify possible structures of the missing values and their relation to the available information. The main goal of this contribution is to stress the importance of exploring missing values using visualization methods and to present a collection of such visualization techniques for incomplete data, all of which are implemented in the the R package VIM. Providing such functionality for this widely used statistical environment, visualization of missing values, imputation and data analysis can all be done from within R without the need of additional software.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.2

Not enough ratings

Secondary Ratings

Novelty

-

Significance

-

Scientific rigor

-

Rate this paper

Recommended

Article Ecology

Handling missing values in trait data

Thomas F. Johnson, Nick J. B. Isaac, Agustin Paviolo, Manuela Gonzalez-Suarez

Summary: The study evaluated the performance of approaches for handling missing values in biased datasets and found that imputation can effectively handle missing data in some conditions but is not always the best solution. None of the tested methods could effectively deal with severe biases, highlighting the importance of rigorous data checking and proposing variables to assist researchers in detecting and minimizing errors in incomplete datasets.

GLOBAL ECOLOGY AND BIOGEOGRAPHY (2021)

Add to Collection

Article Biochemical Research Methods

Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics

Mengbo Li, Gordon K. Smyth

Summary: Mass spectrometry proteomics in biomedical research suffers from the problem of missing values in peptides. Many analysis strategies have been proposed to distinguish different types of missing values and estimate detection probabilities. A logit-linear function is used to accurately model the detection probability, showing that missing values are related to peptide intensity. A probability model is developed to infer the distribution of unobserved intensities from observed values.

BIOINFORMATICS (2023)

Add to Collection

Article Computer Science, Software Engineering

To Explore What Isn't There-Glyph-Based Visualization for Analysis of Missing Values

Sara Johansson Fernstad, Jimmy Johansson Westberg

Summary: This article introduces a novel visualization method called Missingness Glyph for analyzing and exploring missing values in data. The Missingness Glyph helps to identify relevant missingness patterns and performs better than alternative visualization methods in certain cases.

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS (2022)

Add to Collection

Article Biochemical Research Methods

MSnbase, Efficient and Elegant R-Based Processing and Visualization of Raw Mass Spectrometry Data

Laurent Gatto, Sebastian Gibb, Johannes Rainer

Summary: Version 2 of the MSnbase R/Bioconductor package is focused on new on-disk infrastructure for manipulating, processing, and visualizing mass spectrometry data. This update allows handling of large raw mass spectrometry experiments on commodity hardware, showcasing elegant data processing, method development, and visualization capabilities.

JOURNAL OF PROTEOME RESEARCH (2021)

Add to Collection

Article Multidisciplinary Sciences

moreThanANOVA: A user-friendly Shiny/R application for exploring and comparing data with interactive visualization

Wanyanhan Jiang, Han Chen, Lian Yang, Xiaoqi Pan

Summary: When comparing means of different groups, it is necessary to explore and compare data for influencing factors or relative indices. This can be a complex and challenging process, especially for users who lack statistical knowledge and coding experience. To address this issue, we developed moreThanANOVA, an interactive, user-friendly, open-source, and cloud-based application that automates distribution tests and correlative significance tests, allowing users to customize post-hoc analysis based on their considerations.

PLOS ONE (2022)

Add to Collection

Article Computer Science, Information Systems

Clustering mixed numerical and categorical data with missing values

Duy-Tai Dinh, Van-Nam Huynh, Songsak Sriboonchitta

Summary: This paper introduces a novel clustering algorithm k-CMM for handling missing values in mixed numerical and categorical data, integrating imputation and clustering steps. The algorithm utilizes decision tree and mean and kernel methods for cluster center formation, outperforming other algorithms when the dataset has increasing missing values.

INFORMATION SCIENCES (2021)

Add to Collection

Article Computer Science, Information Systems

Exploring granular test coverage and its evolution with matrix visualizations

Kaj Dreef, Vijay Krishna Palepu, James A. Jones

Summary: Current software-development tools make it difficult to understand the test execution of software, both for granular tasks (e.g., identifying test cases for a specific method) and global tasks (e.g., determining the proportion of unit tests to system tests). Existing tools lack global overview and historical information. This paper proposes a novel, interactive, matrix-based visual interface to address these challenges and provides a user study and case studies to demonstrate its effectiveness.

INFORMATION AND SOFTWARE TECHNOLOGY (2023)

Add to Collection

Article Biochemistry & Molecular Biology

ggtreeExtra: Compact Visualization of Richly Annotated Phylogenetic Data

Shuangbin Xu, Zehan Dai, Pingfan Guo, Xiaocong Fu, Shanshan Liu, Lang Zhou, Wenli Tang, Tingze Feng, Meijun Chen, Li Zhan, Tianzhi Wu, Erqiang Hu, Yong Jiang, Xiaochen Bo, Guangchuang Yu

Summary: ggtreeExtra is a universal tool for visualizing tree data, supporting various data types and visualization methods. By integrating evolutionary statistics and external data, it extends the applications of phylogenetic trees in different disciplines.

MOLECULAR BIOLOGY AND EVOLUTION (2021)

Add to Collection

Article Health Care Sciences & Services

Imputation of missing values for electronic health record laboratory data

Jiang Li, Xiaowei S. Yan, Durgesh Chaudhary, Venkatesh Avula, Satish Mudiganti, Hannah Husby, Shima Shahjouei, Ardavan Afshar, Walter F. Stewart, Mohammed Yeasin, Ramin Zand, Vida Abedi

Summary: Laboratory data from EHR can be used in prediction models to mitigate estimation bias and improve model performance with missingness using imputation methods. The study found that missingness in EHR laboratory variables was associated with patients' comorbidity data, and the multi-level imputation algorithm showed smaller imputation error compared to the cross-sectional method.

NPJ DIGITAL MEDICINE (2021)

Add to Collection

Article Computer Science, Software Engineering

SightBi: Exploring Cross-View Data Relationships with Biclusters

Maoyuan Sun, Abdul Rahman Shaikh, Hamed Alhoori, Jian Zhao

Summary: This paper presents SightBi, a visual analytics approach for exploring cross-view data relationships. SightBi formalizes cross-view data relationships, computes them, and utilizes a bi-context design to provide stand-alone relationship views for guiding user exploration. A usage scenario demonstrates the usefulness of SightBi for sensemaking of cross-view data relationships.

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS (2022)

Add to Collection

Article Computer Science, Information Systems

To Tolerate or To Impute Missing Values in V2X Communications Data?

Roozbeh Razavi-Far, Daoming Wan, Mehrdad Saif, Niloofar Mozafari

Summary: This article evaluates main strategies for the treatment of missing values in misbehavior detection using incomplete V2X communications data. It proposes two novel methods for imputing and tolerating missing data and compares them with existing methods. The results show that the proposed missing-tolerant method outperforms others in terms of accuracy and F-measure.

IEEE INTERNET OF THINGS JOURNAL (2022)

Add to Collection

Review Biochemical Research Methods

Dealing with missing values in proteomics data

Weijia Kong, Harvard Wai Hann Hui, Hui Peng, Wilson Wen Bin Goh

Summary: Proteomics data often have missing values, which can affect subsequent statistical analyses. Different missing value imputation methods have been developed, and their performance varies when dealing with the same dataset. Choosing the right method is important for satisfactory results, and other factors such as confounders should also be considered.

PROTEOMICS (2022)

Add to Collection

Article Computer Science, Interdisciplinary Applications

Lumina: an adaptive, automated and extensible prototype for exploring, enriching and visualizing data

Konstantinos Kagkelidis, Ilias Dimitriadis, Athena Vakali

Summary: This paper discusses the improvement of complex visualization pipelines and introduces Lumina, a visualization framework that aims to simplify user experience and interaction, while enhancing the final visualization results based on semantic analysis of linked data.

JOURNAL OF VISUALIZATION (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Two-stage-neighborhood-based multilabel classification for incomplete data with missing labels

Lin Sun, Tianxiang Wang, Weiping Ding, Jiucheng Xu, Anhui Tan

Summary: This paper presents a neighborhood-based multilabel classification method for dealing with missing labels in real-world multilabel data. By defining the neighborhood radius, restoring missing feature values, and investigating the fuzzy similarity relationship among samples, the classification performance of data with missing labels is improved.

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS (2022)

Add to Collection

Article Energy & Fuels

Augmenting energy time-series for data-efficient imputation of missing values

Antonio Liguori, Romana Markovic, Martina Ferrando, Jerome Frisch, Francesco Causone, Christoph van Treeck

Summary: This study investigates the use of data augmentation techniques for reconstructing missing energy time-series in limited data scenarios. A convolutional denoising autoencoder is chosen as the base imputation model, and an optimal augmentation rate is determined based on preliminary results. The results show that augmenting a nine days-long training set 80 times can significantly reduce the initial average RMSE and outperform benchmark methods.

APPLIED ENERGY (2023)

Add to Collection

Article Statistics & Probability

A comparison of generalised linear models and compositional models for ordered categorical data

Ondrej Vencalek, Karel Hron, Peter Filzmoser

STATISTICAL MODELLING (2020)

Add to Collection

Article Computer Science, Interdisciplinary Applications

Cellwise robust M regression

P. Filzmoser, S. Hoppner, I Ortner, S. Serneels, T. Verdonck

COMPUTATIONAL STATISTICS & DATA ANALYSIS (2020)

Add to Collection

Article Computer Science, Artificial Intelligence

Robust and sparse multigroup classification by the optimal scoring approach

Irene Ortner, Peter Filzmoser, Christophe Croux

DATA MINING AND KNOWLEDGE DISCOVERY (2020)

Add to Collection

Article Statistics & Probability

Robust principal component analysis for compositional tables

J. de Sousa, K. Hron, K. Facevicova, P. Filzmoser

Summary: Compositional tables are arranged according to two factors and analyzed by ratios between cells. A special choice of coordinates related to centered logratio coefficients is proposed for interpretation and use in robust principal component analysis. This method enables exploration of relationships between factors while addressing the singularity issue of clr coefficients.

JOURNAL OF APPLIED STATISTICS (2021)

Add to Collection

Article Geosciences, Multidisciplinary

Weighted Symmetric Pivot Coordinates for Compositional Data with Geochemical Applications

Karel Hron, Mark Engle, Peter Filzmoser, Eva Fiserova

Summary: Negative correlations between elements, molecules, or minerals can indicate various geochemical processes. Symmetric pivot coordinates are developed to identify positive and negative correlations between different parts in compositional data.

MATHEMATICAL GEOSCIENCES (2021)

Add to Collection

Article Geosciences, Multidisciplinary

Classical and Robust Regression Analysis with Compositional Data

K. G. van den Boogaart, P. Filzmoser, K. Hron, M. Templ, R. Tolosana-Delgado

Summary: Compositional data contain valuable information within the relationships between the compositional parts, which can be utilized for regression modeling. Balance coordinates are constructed to interpret regression coefficients and test hypotheses of subcompositional independence. Both classical least-squares regression and robust MM regression were compared within different regression models using a real data set from a geochemical mapping project.

MATHEMATICAL GEOSCIENCES (2021)

Add to Collection

Article Geochemistry & Geophysics

pXRF Measurements on Soil Samples for the Exploration of an Antimony Deposit: Example from the Vendean Antimony District (France)

Bruno Lemiere, Jeremie Melleton, Pascal Auger, Virginie Derycke, Eric Gloaguen, Loic Bouat, Dominika Miksova, Peter Filzmoser, Maarit Middleton

MINERALS (2020)

Add to Collection

Article Statistics & Probability

Robust regression with compositional covariates including cellwise outliers

Nikola Stefelova, Andreas Alfons, Javier Palarea-Albaladejo, Peter Filzmoser, Karel Hron

Summary: The study presents a robust procedure for estimating a linear regression model with compositional and real-valued explanatory variables, designed to handle outliers and produce results aligned with established scientific knowledge. By filtering and imputing cellwise outliers before performing rowwise robust compositional regression, the proposed procedure outperforms traditional and other robust regression methods.

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION (2021)

Add to Collection

Article Biochemistry & Molecular Biology

Statistical Analysis of Chemical Element Compositions in Food Science: Problems and Possibilities

Matthias Templ, Barbara Templ

Summary: Our study compares compositional data analysis (CoDa) with classical statistical analysis to demonstrate how results vary depending on the approach, with importance shown for methods like principle component analysis (PCA) and log-ratio analysis. It emphasizes the need to apply CoDa methods for better separation, interpretability, and classification accuracy in analyzing food chemical elements and characterizing food products.

MOLECULES (2021)

Add to Collection

Article Computer Science, Information Systems

A systematic overview on methods to protect sensitive data provided for various analyses

Matthias Templ, Murat Sariyar

Summary: Considering the advancements in protecting sensitive data, especially in privacy-preserving computation and federated learning, there is a need to categorize and compare various methods from different fields. Providing guidance for practice is important, as it helps practitioners have an overview of suitable approaches for specific scenarios. This categorization also contributes to the development of a comprehensive ontology for anonymization.

INTERNATIONAL JOURNAL OF INFORMATION SECURITY (2022)

Add to Collection

Article Geochemistry & Geophysics

A new version of the Langelier-Ludwig square diagram under a compositional perspective

Matthias Templ, Caterina Gozzi, Antonella Buccianti

Summary: The Langelier-Ludwig square diagram is a commonly used diagnostic tool in groundwater chemistry, but the classic version may lead to incorrect conclusions. A new version of the diagram is proposed, which provides a better and unbiased understanding of water-environment interactions by describing the intricate relationship between chemical species in aqueous solutions.

JOURNAL OF GEOCHEMICAL EXPLORATION (2022)

Add to Collection

Article Public, Environmental & Occupational Health

Privacy of Study Participants in Open-access Health and Demographic Surveillance System Data: Requirements Analysis for Data Anonymization

Matthias Templ, Chifundo Kanjala, Inken Siems

Summary: This study aims to highlight the requirements and solutions for sharing health surveillance event history data. The proposed approaches enable the anonymization of data while preserving utility and reducing the risk of disclosure, making the data shareable as public use data. This is particularly significant for HDSS and medical science research communities in low- and middle-income countries.

JMIR PUBLIC HEALTH AND SURVEILLANCE (2022)

Add to Collection

Article Mathematics

Enhancing Precision in Large-Scale Data Analysis: An Innovative Robust Imputation Algorithm for Managing Outliers and Missing Values

Matthias Templ

Summary: In the complex world of data analytics, multiple imputation has emerged as a key tool for addressing missing data, and its powerful variant, robust imputation, further enhances the precision and reliability of its results. Non-robust methods can be influenced by extreme outliers, leading to skewed imputations and biased estimates. Robust imputation methods effectively manage outliers and provide a more reliable approach to dealing with missing data.

MATHEMATICS (2023)

Add to Collection

Article Computer Science, Interdisciplinary Applications

Robust Mediation Analysis: The R Package robmed

Andreas Alfons, Nufer Y. Ates, Patrick J. F. Groenen

Summary: Mediation analysis is a widely used statistical technique in social, behavioral, and medical sciences for studying the indirect effects of independent variables on dependent variables through intervening variables. However, existing methods are sensitive to outliers and deviations from normality assumptions, which can threaten the empirical testing of mediation mechanisms. The robmed package in R implements a robust procedure for mediation analysis that addresses these issues and provides various analysis methods and result visualization.

JOURNAL OF STATISTICAL SOFTWARE (2022)

Add to Collection

Article Psychology, Applied

A Robust Bootstrap Test for Mediation Analysis

Andreas Alfons, Nufer Yasin Ates, Patrick J. F. Groenen

Summary: Mediation analysis is crucial in organizational sciences, but traditional linear regression analysis based on normal-theory maximum likelihood estimators is sensitive to deviations from normality assumptions. To address this issue, a robust mediation method has been developed, which demonstrates superior estimation of effect size and reliability in assessing significance, along with freely available software for empirical researchers.

ORGANIZATIONAL RESEARCH METHODS (2022)

Add to Collection

No Data Available

© Peeref 2019-2024. All rights reserved.