4.7 Article

SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision

期刊

BIOINFORMATICS
卷 30, 期 18, 页码 2652-2653

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btu343

关键词

-

资金

  1. Scientific Exchange Programme NMS-CH [12.289]
  2. Swiss National Science Foundation [310000-116502]

向作者/读者索取更多资源

A Summary: Many time-consuming analyses of next-generation sequencing data can be addressed with modern cloud computing. The Apache Hadoop-based solutions have become popular in genomics because of their scalability in a cloud infrastructure. So far, most of these tools have been used for batch data processing rather than interactive data querying. The SparkSeq software has been created to take advantage of a new MapReduce framework, Apache Spark, for next-generation sequencing data. SparkSeq is a general-purpose, flexible and easily extendable library for genomic cloud computing. It can be used to build genomic analysis pipelines in Scala and run them in an interactive way. SparkSeq opens up the possibility of customized ad hoc secondary analyses and iterative machine learning algorithms. This article demonstrates its scalability and overall fast performance by running the analyses of sequencing datasets. Tests of SparkSeq also prove that the use of cache and HDFS block size can be tuned for the optimal performance on multiple worker nodes.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据