☆ 4.7 Article

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2020)

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Volume 31, Issue 10, Pages 2346-2359

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TPDS.2020.2990924

Keywords

Clustering algorithms; Hardware; Machine learning algorithms; Machine learning; Out of order; Partitioning algorithms; Parallel processing; Machine learning; accelerator; FPGA; out-of-order execution; automatic parallelization

Funding

National Key Research and Development Program of China [2017YFA0700900]
National Science Foundation of China [61976200]
Jiangsu Provincial Natural Science Foundation [BK20181193]
Youth Innovation Promotion Association CAS [2017497]
Fundamental Research Funds for the Central Universities [WK2150110003]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Machine learning has been widely applied in various emerging data-intensive applications, and has to be optimized and accelerated by powerful engines to process very large scale data. Recently, the instruction set based accelerators on Field Progarmmable Gate Arrays (FPGAs) have been a promising topic for machine learning applications. The customized instructions can be further scheduled to achieve higher instruction-level parallelism. In this article, we design a ubiquitous accelerator with out-of-order automatic parallelization for large-scale data-intensive applications. The accelerator accommodates four representative applications, including clustering algorithms, deep neural networks, genome sequencing, and collaborative filtering. In order to improve the coarse-grained instruction-level parallelism, the accelerator employs an out-of-order scheduling method to enable parallel dataflow computation. We use Colored Petri Net (CPN) tools to analyze the dependences in the applications, and build a hardware prototype on the real FPGA platform. For cluster applications, the accelerator can support four different algorithms, including K-Means, SLINK, PAM, and DBSCAN. For collaborative filtering applications, it accommodates Tanimoto, euclidean, Cosine, and Pearson Correlation as Similarity metrics. For deep learning applications, we implement hardware accelerators for both training process and inference process. Finally, for genome sequencing, we design a hardware accelerator for the BWA-SW algorithm. Experimental results show that the accelerator architecture can reach up to 25X speedup against Intel processors with affordable hardware cost, insignificant power consumption, and high flexibility.

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper