4.7 Article

Effect of lossy compression of quality scores on variant calling

期刊

BRIEFINGS IN BIOINFORMATICS
卷 18, 期 2, 页码 183-194

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bib/bbw011

关键词

Genomic data; lossy compression; quality scores; variant calling

资金

  1. Stanford Graduate Fellowship Program in Science and Engineering
  2. Basque Government
  3. Center for Science of Information (CSoI)
  4. National Institutes of Health [2014-07364-01, 1 U01 CA198943-01]
  5. National Science Foundation [1157849-1-QAZCC]
  6. National Library of Medicine Training [T15 LM7033]
  7. National Science Foundation

向作者/读者索取更多资源

Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear. Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据