☆ 4.7 Article

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

PROCEEDINGS OF THE IEEE (2016)

期刊

PROCEEDINGS OF THE IEEE

卷 104, 期 1, 页码 58-92

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/JPROC.2015.2494219

关键词

Big data; distributed matrix algorithms; least absolute deviation; least squares; preconditioning; randomized linear algebra; subspace embedding

类别

Engineering, Electrical & Electronic

资金

U.S. Army Research Office
Defense Advanced Research Projects Agency
U.S. Department of Energy

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

In this era of large-scale data, distributed systems built on top of clusters of commodity hardware provide cheap and reliable storage and scalable processing of massive data. With cheap storage, instead of storing only currently relevant data, it is common to store as much data as possible, hoping that its value can be extracted later. In this way, exabytes (1018 bytes) of data are being created on a daily basis. Extracting value from these data, however, requires scalable implementations of advanced analytical algorithms beyond simple data processing, e.g., statistical regression methods, linear algebra, and optimization algorithms. Most such traditional methods are designed to minimize floating-point operations, which is the dominant cost of in-memory computation on a single machine. In parallel and distributed environments, however, load balancing and communication, including disk and network input/output (I/O), can easily dominate computation. These factors greatly increase the complexity of algorithm design and challenge traditional ways of thinking about the design of parallel and distributed algorithms. Here, we review recent work on developing and implementing randomized matrix algorithms in large-scale parallel and distributed environments. Randomized algorithms for matrix problems have received a great deal of attention in recent years, thus far typically either in theory or in machine learning applications or with implementations on a single machine. Our main focus is on the underlying theory and practical implementation of random projection and random sampling algorithms for very large very overdetermined (i.e., over-constrained) l(1)- and l(2)-regression problems. Randomization can be used in one of two related ways: either to construct subsampled problems that can be solved, exactly or approximately, with traditional numerical methods; or to construct preconditioned versions of the original full problem that are easier to solve with traditional iterative algorithms. Theoretical results demonstrate that in near input-sparsity time and with only a few passes through the data one can obtain very strong relative-error approximate solutions, with high probability. Empirical results highlight the importance of various tradeoffs (e.g., between the time to construct an embedding and the conditioning quality of the embedding, between the relative importance of computation versus communication, etc.) and demonstrate that l(1)- and l(2)-regression problems can be solved to low, medium, or high precision in existing distributed systems on up to terabyte-sized data.

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

期刊

PROCEEDINGS OF THE IEEE

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

期刊

PROCEEDINGS OF THE IEEE

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文