4.7 Article

A comparative evaluation of aggregation methods for machine learning over vertically partitioned data

Journal

EXPERT SYSTEMS WITH APPLICATIONS
Volume 152, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2020.113406

Keywords

Vertical data partitioning; Distributed machine learning; Classification; Predictions aggregation; Attribute-partitioned data

Funding

  1. State Funding Agency of Rio Grande do Sul (FAPERGS) through the Scientific Initiation Scholarship Program [17/2551-00 0 0236-9]

Ask authors/readers for more resources

It is increasingly common applications where data are naturally generated in a distributed fashion, especially after the emergence of technologies like the Internet of Things (IoT). In sensor networks, in collaborative health or genomic projects, in credit risk analysis, among other domains, distinct features are collected from multiple sources, including the use of social media and mobile applications, and due to privacy concerns or communication costs, may not be shared among sites. This scenario of vertical data partitioning poses challenges to traditional machine learning (ML) approaches, as classical algorithms are designed to learn from the complete set of features. A common strategy is to combine predictions from local models trained at each site into a global model, and for this purpose, several aggregation methods have been proposed. In this work we tackle a gap within the related literature, performing a comparative evaluation of elementary and meta-learning-based aggregation methods to reveal their strengths and weakness for 46 datasets with varied characteristics. We show that no method outperforms its counterparts in all domains, emphasizing the need for experimental comparison to ensure a good choice in the domain of interest. Moreover, our experiments provide the first insights into the relations between datasets' properties and aggregators' performance. We show that for low class imbalance and a good instance-to-feature ratio, almost all aggregation methods tend to perform well. The silhouette coefficient (reflecting class separability) and class imbalance coefficient are the most influential properties on aggregators' performance, thus we recommend their analysis in the first step of the methodological design. We found that arithmetic-based methods are not suitable for datasets with poor class separability and a large number of classes, whereas meta-learning approaches are less sensitive for datasets with silhouette coefficient close to 0. Our analyses were summarized as classification and regression trees, which have the impact to serve as practical tools for future research. Taken together, our findings give rise to interesting applications in the domain of intelligent systems, especially regarding their potential to reduce the burden of vast experimental comparisons when training ML models with feature-partitioned data. (C) 2020 Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available