4.5 Article

Accelerating sparse matrix-matrix multiplication with GPU Tensor Cores

Journal

COMPUTERS & ELECTRICAL ENGINEERING
Volume 88, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.compeleceng.2020.106848

Keywords

Sparse matrix multiplication; GPU; Tensor Cores; Parallel computing; SpGEMM

Funding

  1. High Performance Soft-tissue Navigation (HIPERNAV - H2020-MSCA-ITN-2016)
  2. European Union [722068]

Ask authors/readers for more resources

Sparse general matrix-matrix multiplication (spGEMM) is an essential component in many scientific and data analytics applications. However, the sparsity pattern of the input matrices and the interaction of their patterns make spGEMM challenging. Modern GPUs include Tensor Core Units (TCUs), which specialize in dense matrix multiplication. Our aim is to re-purpose TCUs for sparse matrices. The key idea of our spGEMM algorithm, tSparse, is to multiply sparse rectangular blocks using the mixed precision mode of TCUs. tSparse partitions the input matrices into files and operates only on files which contain one or more elements. It creates a task list of the files, and performs matrix multiplication of these files using TCUs. To the best of our knowledge, this is the first time that TCUs are used in the context of spGEMM. We show that spGEMM, with our filing approach, benefits from TCUs. Our approach significantly improves the performance of spGEMM in comparison to cuSPARSE, CUSP, RMerge2, Nsparse, AC-SpGEMM and spECK.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Computer Science, Hardware & Architecture

Lightweight method of shuffling overlapped data-blocks for data integrity and security in WSNs

Francisco Alcaraz Velasco, Jose Manuel Palomares, Joaquin Olivares

Summary: This study introduces a new data integrity method with medium security levels and low energy cost in wireless sensor networks, using a lightweight mechanism with overlapping blocks for data protection, demonstrating its effectiveness through experiments.

COMPUTER NETWORKS (2021)

Article Computer Science, Hardware & Architecture

FPGA-Based Near-Memory Acceleration of Modern Data-Intensive Applications

Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios Diamantopoulos, Juan Gomez-Luna, Henk Corporaal, Onur Mutlu

Summary: Modern data-intensive applications require high computational capabilities but are limited by strict power constraints. The development of FPGAs with HBM provides a solution to alleviate the bottleneck of data movement, improving efficiency and energy savings in computing systems.

IEEE MICRO (2021)

Article Computer Science, Hardware & Architecture

Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric

Gagandeep Singh, Dionysios Diamantopoulos, Juan Gomez-Luna, Christoph Hagleitner, Sander Stuijk, Henk Corporaal, Onur Mutlu

Summary: The ongoing climate change requires fast and accurate weather and climate modeling. However, current CPU and GPU implementations face limitations in performance and energy consumption for large-scale weather prediction simulations. To overcome these challenges, near-memory acceleration using high-bandwidth memory (HBM) is proposed and evaluated. Experimental results show significant performance improvement and energy efficiency compared to traditional methods.

ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS (2022)

Article Computer Science, Hardware & Architecture

PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM

Ataberk Olgun, Juan Gomez Luna, Konstantinos Kanellopoulos, Behzad Salami, Hasan Hassan, Oguz Ergin, Onur Mutlu

Summary: This paper introduces commodity DRAM-based processing-using-memory (PuM) techniques that can alleviate the data movement bottleneck at low cost. The challenges of system integration for these techniques are discussed, and a flexible framework called Processing-in-DRAM (PiDRAM) is developed to address these challenges. The authors implement and evaluate two PuM techniques, demonstrating the flexibility and effectiveness of PiDRAM. The potential performance improvement brought by PiDRAM is observed.

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION (2022)

Article Computer Science, Hardware & Architecture

Accelerating Neural Network Inference With Processing-in-DRAM: From the Edge to the Cloud

Geraldo F. Oliveira, Juan Gomez-Luna, Saugata Ghose, Amirali Boroumand, Onur Mutlu

Summary: Neural networks (NNs) are becoming increasingly important and complex. Processing-in-memory (PIM) paradigm can accelerate memory-bound NNs, but different PIM architectures have different effects on NN performance and energy efficiency.

IEEE MICRO (2022)

Article Computer Science, Artificial Intelligence

3D reconstruction system and multiobject local tracking algorithm designed for billiards

Francisco J. J. Rodriguez-Lozano, Juan C. C. Gamez-Granados, Hector Martinez, Jose M. M. Palomares, Joaquin Olivares

Summary: The use of virtual reality or augmented reality systems in billiards sports is helpful for entertainment and improving player's skills. However, tracking multiple small identical objects like balls can be challenging. This research proposes a new tracking algorithm called MOLT, which can accurately track balls even with motion blur caused by low-resolution and low-frame-rate devices. The proposed system covers all steps from image capture to 3D reconstruction using computer vision, providing a promising and useful tool for training.

APPLIED INTELLIGENCE (2023)

Article Computer Science, Interdisciplinary Applications

Efficient data dimensionality reduction method for improving road crack classification algorithms

Francisco J. Rodriguez-Lozano, Juan C. Gamez-Granados, Jose M. Palomares, Joaquin Olivares

Summary: Automatic crack classification is important for road maintenance. However, using many features for classification is inefficient for embedded systems with low computational resources. This study proposes a new data dimensionality reduction (DDR) method called DDR4CC, which reduces the required information about cracks to only four features. The effectiveness of DDR4CC is compared with eight other DDR methods using five different classification algorithms and datasets. Results show that DDR4CC improves the classification algorithms, providing highly accurate classifiers with minimal computation time.

COMPUTER-AIDED CIVIL AND INFRASTRUCTURE ENGINEERING (2023)

Article Computer Science, Information Systems

ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric Systems

Nika Mansouri Ghiasi, Nandita Vijaykumar, Geraldo F. Oliveira, Lois Orosa, Ivan Fernandez, Mohammad Sadrosadati, Konstantinos Kanellopoulos, Nastaran Hajinazar, Juan Gomez Luna, Onur Mutlu

Summary: Partitioning applications between near-data processing (NDP) and host CPU cores causes inter-segment data movement overhead, which can be mitigated by ALP, a programmer-transparent technique that proactively and accurately transfers required data between segments based on the invariant instructions. Evaluation on a wide range of workloads demonstrates significant speedup over traditional CPU-only and NDP-only executions.

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING (2023)

Article Computer Science, Information Systems

Casper: Accelerating Stencil Computations Using Near-Cache Processing

Alain Denzler, Geraldo F. Oliveira, Nastaran Hajinazar, Rahul Bera, Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu

Summary: This paper introduces Casper, a near-cache accelerator that improves the performance of stencil computations and reduces system energy consumption. Casper is designed based on two key ideas: avoiding the cost of moving rarely reused data throughout the cache hierarchy, and exploiting the regularity of data accesses and inherent parallelism of stencil computations. Experimental results show that Casper improves performance by an average of 1.65x (up to 4.16x) compared to commercial high-performance multi-core processors, while reducing system energy consumption by an average of 35% (up to 65%). Casper provides 37x (up to 190x) improvement in performance-per-area compared to a state-of-the-art GPU.

IEEE ACCESS (2023)

Proceedings Paper Computer Science, Artificial Intelligence

A Preliminary Fuzzy Markup Language based Approach for the Queue Buffer Size Optimization in Fog Nodes for Stream Processing

Gregorio Corpas-Prieto, Fernando Leon-Garcia, Juan Carlos Gamez-Granados, Jose Manuel Palomares, Joaquin Olivares, Jose Manuel Soto-Hidalgo

Summary: The Internet of Things (IoT) is divided into edge, fog, and cloud layers. The fog layer enables stream processing by handling data transmission and cascade processing. To optimize network traffic, factors such as connections, delays, and buffer size need to be considered, which are affected by uncertainty and imprecision. Fuzzy rule-based systems are suitable for managing complex data and imprecision. The proposed approach dynamically adjusts buffer size to prevent network collapse.

2022 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE) (2022)

Proceedings Paper Computer Science, Information Systems

Optimum Vessel Segmentation

Joaquin Olivares, Orestis Zachariadis, Nitin Satpute, Juan Gomez-Luna

Summary: Accurate blood vessel segmentation in medical imaging is crucial for surgeries. In this study, we introduce a parallelized region growth algorithm (pSRG) that computes the gradient using Persistence and grid-stride loops. This approach eliminates unnecessary memory transfers, leading to faster computation and more precise segmentation.

2022 17TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI) (2022)

Proceedings Paper Computer Science, Information Systems

Analysis of the random shuffling of message blocks as a low-cost integrity and security measure

Francisco Alcaraz-Velasco, Jose M. Palomares, Joaquin Olivares

Summary: Recently, a mechanism that randomly shuffles the data sent and allows securing the communication without the need to encrypt all the information has been proposed. This proposal is ideal for IoT systems with low computational capacity. It has been demonstrated that obtaining the original message without knowledge of the applied disordering is unfeasible with current technology, ensuring its safety.

2022 17TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI) (2022)

Article Computer Science, Information Systems

Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System

Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu

Summary: This paper provides a comprehensive analysis of the first publicly-available real-world PIM architecture. Experimental characterization and benchmark evaluation on the UPMEM PIM system offer new insights into performance, energy consumption, and suitability for different workloads.

IEEE ACCESS (2022)

Article Computer Science, Information Systems

Cross-Modality Guided Contrast Enhancement for Improved Liver Tumor Image Segmentation

Rabia Naseem, Zohaib Amjad Khan, Nitin Satpute, Azeddine Beghdadi, Faouzi Alaya Cheikh, Joaquin Olivares

Summary: The proposed goal-oriented contrast enhancement method improves tumor segmentation performance by enhancing guided image and controlling image quality through optimization.

IEEE ACCESS (2021)

Article Computer Science, Information Systems

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Geraldo F. Oliveira, Juan Gomez-Luna, Lois Orosa, Saugata Ghose, Nandita Vijaykumar, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu

Summary: Data movement between the CPU and main memory is a major bottleneck for improving performance, scalability, and energy efficiency in modern computer systems. Various techniques have been employed to reduce this overhead, from traditional cache hierarchies to emerging Near-Data Processing (NDP) methods. However, there is still a lack of understanding regarding the key metrics for identifying data movement bottlenecks and their relation to different mitigation mechanisms.

IEEE ACCESS (2021)

Article Computer Science, Hardware & Architecture

Discovering e-commerce user groups from online comments: An emotional correlation analysis-based clustering method

Jia Ke, Ying Wang, Mingyue Fan, Xiaojun Chen, Wenlong Zhang, Jianping Gou

Summary: This study integrates the emotional correlation analysis model and Self-organizing Map (SOM) to construct fine-grained user emotion vector based on review text and perform visual cluster analysis, which helps platform merchants quickly mine user clustering and characteristics.

COMPUTERS & ELECTRICAL ENGINEERING (2024)

Article Computer Science, Hardware & Architecture

Multilevel-based algorithm for hyperspectral image interpretation

Shi Qiu, Huping Ye, Xiaohan Liao, Benyue Zhang, Miao Zhang, Zimu Zeng

Summary: This paper proposes a multilevel-based algorithm for hyperspectral image interpretation, which achieves semantic segmentation through multidimensional information fusion, and introduces a context interpretation module to improve detection performance.

COMPUTERS & ELECTRICAL ENGINEERING (2024)

Article Computer Science, Hardware & Architecture

Maximizing the profit of omnichannel closed-loop supply chains with mean-variance criteria

Jianteng Xu, Qingguo Bai, Zhiwen Li, Lili Zhao

Summary: This study constructs two optimization models for the omnichannel closed-loop supply chain by leveraging the combined power of leader-follower game and mean-variance theories. The focus is on analyzing the performance of manufacturers who distribute products through physical stores. The results show that the risk-averse attitude of the physical store has a positive impact on the overall system profitability, but if the introduced physical store belongs to another firm, total profit experiences a decline.

COMPUTERS & ELECTRICAL ENGINEERING (2024)

Article Computer Science, Hardware & Architecture

GraphPhys: Facial video-based physiological measurement with graph neural network

Jiahao Xiong, Weihua Ou, Zhonghua Liu, Jianping Gou, Wenjun Xiao, Haitao Liu

Summary: This paper proposes a novel remote photoplethysmography framework, named GraphPhys, which utilizes graph neural network to extract physiological signals and introduces Average Relative GraphConv for the task of remote physiological signal measurement. Experimental results show that the methods based on GraphPhys significantly outperform the original methods.

COMPUTERS & ELECTRICAL ENGINEERING (2024)

Article Computer Science, Hardware & Architecture

User financial credit analysis for blockchain regulation

Zhiyao Tong, Yiyi Hu, Chi Jiang, Yin Zhang

Summary: The rise of illicit activities involving blockchain digital currencies has become a growing concern. In order to prevent illegal activities, this study combines financial risk control with machine learning to identify and predict the risks of users with poor credit. Experimental results demonstrate high performance in user financial credit analysis.

COMPUTERS & ELECTRICAL ENGINEERING (2024)