4.7 Article

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
Volume 31, Issue 10, Pages 2346-2359

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TPDS.2020.2990924

Keywords

Clustering algorithms; Hardware; Machine learning algorithms; Machine learning; Out of order; Partitioning algorithms; Parallel processing; Machine learning; accelerator; FPGA; out-of-order execution; automatic parallelization

Funding

  1. National Key Research and Development Program of China [2017YFA0700900]
  2. National Science Foundation of China [61976200]
  3. Jiangsu Provincial Natural Science Foundation [BK20181193]
  4. Youth Innovation Promotion Association CAS [2017497]
  5. Fundamental Research Funds for the Central Universities [WK2150110003]

Ask authors/readers for more resources

Machine learning has been widely applied in various emerging data-intensive applications, and has to be optimized and accelerated by powerful engines to process very large scale data. Recently, the instruction set based accelerators on Field Progarmmable Gate Arrays (FPGAs) have been a promising topic for machine learning applications. The customized instructions can be further scheduled to achieve higher instruction-level parallelism. In this article, we design a ubiquitous accelerator with out-of-order automatic parallelization for large-scale data-intensive applications. The accelerator accommodates four representative applications, including clustering algorithms, deep neural networks, genome sequencing, and collaborative filtering. In order to improve the coarse-grained instruction-level parallelism, the accelerator employs an out-of-order scheduling method to enable parallel dataflow computation. We use Colored Petri Net (CPN) tools to analyze the dependences in the applications, and build a hardware prototype on the real FPGA platform. For cluster applications, the accelerator can support four different algorithms, including K-Means, SLINK, PAM, and DBSCAN. For collaborative filtering applications, it accommodates Tanimoto, euclidean, Cosine, and Pearson Correlation as Similarity metrics. For deep learning applications, we implement hardware accelerators for both training process and inference process. Finally, for genome sequencing, we design a hardware accelerator for the BWA-SW algorithm. Experimental results show that the accelerator architecture can reach up to 25X speedup against Intel processors with affordable hardware cost, insignificant power consumption, and high flexibility.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available