4.4 Article

De-identifying a public use microdata file from the Canadian national discharge abstract database

期刊

出版社

BMC
DOI: 10.1186/1472-6947-11-53

关键词

-

资金

  1. Canadian Institute for Health Information
  2. Ontario Institute for Cancer Research
  3. Canadian Institutes of Health Research

向作者/读者索取更多资源

Background: The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records. Methods: Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy. Results: Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression. Conclusions: The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.4
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Article Computer Science, Interdisciplinary Applications

A risk-based framework for biomedical data sharing

Fida K. Dankar, Radja Badji

JOURNAL OF BIOMEDICAL INFORMATICS (2017)

Article Medical Informatics

Estimating the re-identification risk of clinical data sets

Fida Kamal Dankar, Khaled El Emam, Angelica Neisa, Tyson Roffey

BMC MEDICAL INFORMATICS AND DECISION MAKING (2012)

Article Medical Informatics

Evaluating the risk of patient re-identification from adverse drug event reports

Khaled El Emam, Fida K. Dankar, Angelica Neisa, Elizabeth Jonker

BMC MEDICAL INFORMATICS AND DECISION MAKING (2013)

Article Multidisciplinary Sciences

A Protocol for the Secure Linking of Registries for HPV Surveillance

Khaled El Emam, Saeed Samet, Jun Hu, Liam Peyton, Craig Earle, Gayatri C. Jayaraman, Tom Wong, Murat Kantarcioglu, Fida Dankar, Aleksander Essex

PLOS ONE (2012)

Review Biochemistry & Molecular Biology

Informed Consent in Biomedical Research

Fida K. Dankar, Marton Gergely, Samar K. Dankar

COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL (2019)

Article Chemistry, Multidisciplinary

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

Fida K. Dankar, Mahmoud Ibrahim

Summary: Synthetic data provides a privacy-protecting mechanism for healthcare data, generating artificial datasets without identifiable information for safe sharing. The paper evaluates the impact of different synthetic data generation and usage settings on the utility of the data and models, aiming to provide insights into the best practices when working with synthetic data.

APPLIED SCIENCES-BASEL (2021)

Article Computer Science, Information Systems

A Multi-Dimensional Evaluation of Synthetic Data Generators

Fida K. Dankar, Mahmoud K. Ibrahim, Leila Ismail

Summary: This paper proposes four criteria for masked data evaluation and compares four data synthesizers using representative metrics, while also examining the correlations between the selected metrics.

IEEE ACCESS (2022)

Review Biochemistry & Molecular Biology

Dynamic-informed consent: A potential solution for ethical dilemmas in population sequencing initiatives

Fida K. Dankar, Marton Gergely, Bradley Malin, Radja Badji, Samar K. Dankar, Khaled Shuaib

COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL (2020)

Article Medical Informatics

Privacy-Preserving Analysis of Distributed Biomedical Data: Designing Efficient and Secure Multiparty Computations Using Distributed Statistical Learning Theory

Fida K. Dankar, Nisha Madathil, Samar K. Dankar, Sabri Boughorbel

JMIR MEDICAL INFORMATICS (2019)

Review Genetics & Heredity

The development of large-scale de-identified biomedical databases in the age of genomics-principles and challenges

Fida K. Dankar, Andrey Ptitsyn, Samar K. Dankar

HUMAN GENOMICS (2018)

Proceedings Paper Computer Science, Information Systems

A Theoretical multi-Level Privacy Protection Framework for Biomedical Data Warehouses

Fida K. Dankar, Rashid Al Ali

6TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS (EUSPN 2015)/THE 5TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE (ICTH-2015) (2015)

Article Computer Science, Theory & Methods

Privacy Preserving Linear Regression on Distributed Databases

Fida K. Dankar

TRANSACTIONS ON DATA PRIVACY (2015)

Proceedings Paper Computer Science, Information Systems

Efficient Private Information Retrieval for Geographical Aggregation

Fida K. Dankar, Khaled El Emam, Stan Matwin

5TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS / THE 4TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE / AFFILIATED WORKSHOPS (2014)

Review Computer Science, Theory & Methods

Practicing Differential Privacy in Health Care: A Review

Fida K. Dankar, Khaled El Emam

TRANSACTIONS ON DATA PRIVACY (2013)

暂无数据