4.7 Review

Bioinformatics applications on Apache Spark

期刊

GIGASCIENCE
卷 7, 期 8, 页码 -

出版社

OXFORD UNIV PRESS
DOI: 10.1093/gigascience/giy098

关键词

next-generation sequencing; bioinformatics; Apache Spark; resilient distributed dataset; memory computing

资金

  1. National Key R&D Program of China [2018YFC090002, 2017YFB0202602, 2017YFC1311003, 2017YFB0202104, 2016YFC1302500, 2016YFB0200400]
  2. National Natural Science Foundation of China [61772543, U1435222, 61625202, 61272056, 61771331]
  3. Funds of State Key Laboratory of Chemo/Biosensing and Chemometrics
  4. Fundamental Research Funds for the Central Universities
  5. Guangdong Provincial Department of Science and Technology [2016B090918122]

向作者/读者索取更多资源

With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of-the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据