Article
Computer Science, Information Systems
Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Linyong Huang, Hongzhong Zheng, Yuan Xie
Summary: Sparse general matrix multiplication (SpGEMM) is an important computation in many applications, but achieving high-performance SpGEMM on modern processors is challenging. Existing SpGEMM libraries focus on algorithm design but neglect low-level architecture-specific optimizations, resulting in inefficient implementations. This paper proposes a highly optimized SpGEMM library called OpSparse, which improves performance through various optimization techniques such as optimizing memory utilization, reducing access to hash tables, and improving execution parallelism. Evaluation results on an Nvidia Tesla V100 GPU show significant speedups compared to state-of-the-art SpGEMM libraries.
Article
Computer Science, Theory & Methods
Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang
Summary: This article provides a structured and comprehensive overview of the research on General Sparse Matrix-Matrix Multiplication (SpGEMM). It categorizes existing research based on target architectures and design choices, covering topics such as applications, compression formats, formulations, optimizations, and programming models. The article analyzes and summarizes the rationales of different algorithms and presents a thorough performance comparison of existing implementations. Future research directions are also highlighted to encourage better design and implementations in later studies.
ACM COMPUTING SURVEYS
(2023)
Article
Computer Science, Information Systems
Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Hongzhong Zheng, Yuan Xie
Summary: Sparse general matrix multiplication (SpGEMM) is a fundamental building block for many real-world applications. This paper proposes a novel and efficient accumulation method named BRMerge for multi-core CPU architectures. The proposed method demonstrates improved memory access efficiency and outperforms the existing SpGEMM libraries in terms of performance in the evaluations with commonly used benchmarks.
Article
Computer Science, Theory & Methods
Cristobal A. Navarro, Roberto Carrasco, Ricardo J. Barrientos, Javier A. Riquelme, Raimundo Vega
Summary: This article introduces a parallel algorithm for arithmetic reduction using GPU tensor cores, achieving faster performance and energy efficiency. Experimental results demonstrate that the proposed method outperforms standard GPU reduction and Nvidia's CUB library by approximately 3.2x and 2x, respectively, while maintaining low numerical error.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
(2021)
Article
Computer Science, Hardware & Architecture
Xiao-Yang Liu, Zeliang Zhang, Zhiyuan Wang, Han Lu, Xiaodong Wang, Anwar Walid
Summary: This paper presents hardware-oriented optimization strategies for tensor learning primitives on GPU tensor cores, resulting in significant speedups for tasks such as tensor decomposition and neural network compression. The proposed optimizations achieve up to 32.25x speedup compared to existing libraries like TensorLab and TensorLy, demonstrating the effectiveness of GPU-based tensor learning.
IEEE TRANSACTIONS ON COMPUTERS
(2023)
Article
Computer Science, Theory & Methods
Haotian Wang, Wangdong Yang, Rong Hu, Renqiu Ouyang, Kenli Li, Keqin Li
Summary: This paper presents a novel approach called SpTMCM and investigates its coupling with the Tensor Core Unit (TCU). The proposed approach offers a uniform storage format and optimization method for SpTMCM, addressing the inefficient memory accesses caused by irregular distribution of sparse tensors. A TCU-based tensor parallel algorithm is developed to improve memory bandwidth. Experimental results show significant speedups compared to state-of-the-art methods for SpMTTKRP and SpTTMChain on real-world sparse tensors using NVIDIA A100 GPU.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
(2023)
Article
Physics, Multidisciplinary
Takuya Okuyama, Andre Rohm, Takatomo Mihana, Makoto Naruse
Summary: Matrix multiplication is important for various applications, and reducing computation time is crucial. Despite the potential of GPUs, research has not focused on accelerating AMMs for general matrices. In this paper, we propose a method to improve Monte Carlo AMMs, with optimal values for hyperparameters. The proposed method enhances matrix product approximation without increasing computation time, and is compatible with parallel operations on GPUs, demonstrating halved computation time compared to the conventional power method.
Article
Chemistry, Multidisciplinary
Javier Fernandez, Jon Perez-Cerrolaza, Irune Agirre, Alejandro J. Calderon, Jaume Abella, Francisco J. Cazorla
Summary: This paper presents a safe matrix-matrix multiplication software implementation for GPUs with random hardware error-detection capabilities, which serves as a foundation for the implementation of safe deep learning libraries for GPUs. The performance impact and achievable diagnostic coverage of these mechanisms are measured with a set of representative matrix dimensions.
APPLIED SCIENCES-BASEL
(2022)
Article
Geochemistry & Geophysics
Zhenlong Hou, Boxuan Sun, Pengbo Qin, Chong Zhang, Zhaohai Meng
Summary: This paper proposes a parallel joint nonlinear inversion method for full tensor gravity gradiometry data, aiming to improve interpretation and computing ability. By utilizing a graphics processing unit (GPU), a parallel solution is implemented. Data tests demonstrate that this method has good anti-noise performance and accuracy, making it suitable for large-scale inversions.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
(2022)
Article
Computer Science, Theory & Methods
Jiaquan Gao, Yifei Xia, Renjie Yin, Guixia He
Summary: An adaptive sparse matrix-vector multiplication (SpMV) for diagonal sparse matrices on GPU, named DIA-Adaptive, is presented to automatically choose the ideal storage format and kernel, achieving high performance.
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
(2021)
Article
Computer Science, Software Engineering
Bin Qi, Kazuhiko Komatsu, Masayuki Sato, Hiroaki Kobayashi
Summary: Sparse matrix-matrix multiplication is a fundamental kernel used in many algorithms. This article proposes a dynamic parameter tuning method to balance the load among processes in order to improve the performance of SpMM.
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE
(2023)
Article
Computer Science, Information Systems
Jueon Park, Kyungyong Lee
Summary: In this paper, we propose a model called S-MPEC for predicting and optimizing the latency of sparse matrix multiplication (SPMM) tasks in distributed cloud environments using Apache Spark. By characterizing different distributed SPMM implementation methods and considering the characteristics and hardware specifications of the cloud, we establish an accurate prediction model that recommends the optimal implementation method. The experimental results show that users can expect a 44% reduction in latency compared to native SPMM implementations in Apache Spark.
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS
(2023)
Article
Computer Science, Software Engineering
Gonzalo Berger, Manuel Freire, Renzo Marini, Ernesto Dufrechou, Pablo Ezzatti
Summary: Sparse matrix multiplication has become increasingly important in data science and machine learning applications, leading to research focusing on accelerating this kernel in GPUs. Introducing new sparse matrix storage formats to mitigate irregularity, optimizations can significantly outperform existing implementations in experiments and compete with mature algorithms.
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE
(2022)
Article
Computer Science, Artificial Intelligence
Massimiliano Fasi, Nicholas J. Higham, Mantas Mikaitis, Srikara Pranesh
Summary: The study investigates the floating-point arithmetic implemented in NVIDIA tensor cores, determining important details through experiments on different graphics cards. It also provides a test suite that can be easily adapted for testing newer versions of NVIDIA tensor cores and similar accelerators from other vendors.
PEERJ COMPUTER SCIENCE
(2021)
Article
Computer Science, Software Engineering
Guixia He, Qi Chen, Jiaquan Gao
Summary: This paper introduces a new diagonal storage format RBDCS and proposes an efficient SpMV kernel for handling multidiagonal sparse matrices. Experimental results demonstrate that the RBDCS kernel outperforms popular diagonal SpMV kernels.
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE
(2021)
Article
Computer Science, Hardware & Architecture
Francisco Alcaraz Velasco, Jose Manuel Palomares, Joaquin Olivares
Summary: This study introduces a new data integrity method with medium security levels and low energy cost in wireless sensor networks, using a lightweight mechanism with overlapping blocks for data protection, demonstrating its effectiveness through experiments.
Article
Computer Science, Hardware & Architecture
Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios Diamantopoulos, Juan Gomez-Luna, Henk Corporaal, Onur Mutlu
Summary: Modern data-intensive applications require high computational capabilities but are limited by strict power constraints. The development of FPGAs with HBM provides a solution to alleviate the bottleneck of data movement, improving efficiency and energy savings in computing systems.
Article
Computer Science, Hardware & Architecture
Gagandeep Singh, Dionysios Diamantopoulos, Juan Gomez-Luna, Christoph Hagleitner, Sander Stuijk, Henk Corporaal, Onur Mutlu
Summary: The ongoing climate change requires fast and accurate weather and climate modeling. However, current CPU and GPU implementations face limitations in performance and energy consumption for large-scale weather prediction simulations. To overcome these challenges, near-memory acceleration using high-bandwidth memory (HBM) is proposed and evaluated. Experimental results show significant performance improvement and energy efficiency compared to traditional methods.
ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS
(2022)
Article
Computer Science, Hardware & Architecture
Ataberk Olgun, Juan Gomez Luna, Konstantinos Kanellopoulos, Behzad Salami, Hasan Hassan, Oguz Ergin, Onur Mutlu
Summary: This paper introduces commodity DRAM-based processing-using-memory (PuM) techniques that can alleviate the data movement bottleneck at low cost. The challenges of system integration for these techniques are discussed, and a flexible framework called Processing-in-DRAM (PiDRAM) is developed to address these challenges. The authors implement and evaluate two PuM techniques, demonstrating the flexibility and effectiveness of PiDRAM. The potential performance improvement brought by PiDRAM is observed.
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION
(2022)
Article
Computer Science, Hardware & Architecture
Geraldo F. Oliveira, Juan Gomez-Luna, Saugata Ghose, Amirali Boroumand, Onur Mutlu
Summary: Neural networks (NNs) are becoming increasingly important and complex. Processing-in-memory (PIM) paradigm can accelerate memory-bound NNs, but different PIM architectures have different effects on NN performance and energy efficiency.
Article
Computer Science, Artificial Intelligence
Francisco J. J. Rodriguez-Lozano, Juan C. C. Gamez-Granados, Hector Martinez, Jose M. M. Palomares, Joaquin Olivares
Summary: The use of virtual reality or augmented reality systems in billiards sports is helpful for entertainment and improving player's skills. However, tracking multiple small identical objects like balls can be challenging. This research proposes a new tracking algorithm called MOLT, which can accurately track balls even with motion blur caused by low-resolution and low-frame-rate devices. The proposed system covers all steps from image capture to 3D reconstruction using computer vision, providing a promising and useful tool for training.
APPLIED INTELLIGENCE
(2023)
Article
Computer Science, Interdisciplinary Applications
Francisco J. Rodriguez-Lozano, Juan C. Gamez-Granados, Jose M. Palomares, Joaquin Olivares
Summary: Automatic crack classification is important for road maintenance. However, using many features for classification is inefficient for embedded systems with low computational resources. This study proposes a new data dimensionality reduction (DDR) method called DDR4CC, which reduces the required information about cracks to only four features. The effectiveness of DDR4CC is compared with eight other DDR methods using five different classification algorithms and datasets. Results show that DDR4CC improves the classification algorithms, providing highly accurate classifiers with minimal computation time.
COMPUTER-AIDED CIVIL AND INFRASTRUCTURE ENGINEERING
(2023)
Article
Computer Science, Information Systems
Nika Mansouri Ghiasi, Nandita Vijaykumar, Geraldo F. Oliveira, Lois Orosa, Ivan Fernandez, Mohammad Sadrosadati, Konstantinos Kanellopoulos, Nastaran Hajinazar, Juan Gomez Luna, Onur Mutlu
Summary: Partitioning applications between near-data processing (NDP) and host CPU cores causes inter-segment data movement overhead, which can be mitigated by ALP, a programmer-transparent technique that proactively and accurately transfers required data between segments based on the invariant instructions. Evaluation on a wide range of workloads demonstrates significant speedup over traditional CPU-only and NDP-only executions.
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING
(2023)
Article
Computer Science, Information Systems
Alain Denzler, Geraldo F. Oliveira, Nastaran Hajinazar, Rahul Bera, Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu
Summary: This paper introduces Casper, a near-cache accelerator that improves the performance of stencil computations and reduces system energy consumption. Casper is designed based on two key ideas: avoiding the cost of moving rarely reused data throughout the cache hierarchy, and exploiting the regularity of data accesses and inherent parallelism of stencil computations. Experimental results show that Casper improves performance by an average of 1.65x (up to 4.16x) compared to commercial high-performance multi-core processors, while reducing system energy consumption by an average of 35% (up to 65%). Casper provides 37x (up to 190x) improvement in performance-per-area compared to a state-of-the-art GPU.
Proceedings Paper
Computer Science, Artificial Intelligence
Gregorio Corpas-Prieto, Fernando Leon-Garcia, Juan Carlos Gamez-Granados, Jose Manuel Palomares, Joaquin Olivares, Jose Manuel Soto-Hidalgo
Summary: The Internet of Things (IoT) is divided into edge, fog, and cloud layers. The fog layer enables stream processing by handling data transmission and cascade processing. To optimize network traffic, factors such as connections, delays, and buffer size need to be considered, which are affected by uncertainty and imprecision. Fuzzy rule-based systems are suitable for managing complex data and imprecision. The proposed approach dynamically adjusts buffer size to prevent network collapse.
2022 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE)
(2022)
Proceedings Paper
Computer Science, Information Systems
Joaquin Olivares, Orestis Zachariadis, Nitin Satpute, Juan Gomez-Luna
Summary: Accurate blood vessel segmentation in medical imaging is crucial for surgeries. In this study, we introduce a parallelized region growth algorithm (pSRG) that computes the gradient using Persistence and grid-stride loops. This approach eliminates unnecessary memory transfers, leading to faster computation and more precise segmentation.
2022 17TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI)
(2022)
Proceedings Paper
Computer Science, Information Systems
Francisco Alcaraz-Velasco, Jose M. Palomares, Joaquin Olivares
Summary: Recently, a mechanism that randomly shuffles the data sent and allows securing the communication without the need to encrypt all the information has been proposed. This proposal is ideal for IoT systems with low computational capacity. It has been demonstrated that obtaining the original message without knowledge of the applied disordering is unfeasible with current technology, ensuring its safety.
2022 17TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI)
(2022)
Article
Computer Science, Information Systems
Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu
Summary: This paper provides a comprehensive analysis of the first publicly-available real-world PIM architecture. Experimental characterization and benchmark evaluation on the UPMEM PIM system offer new insights into performance, energy consumption, and suitability for different workloads.
Article
Computer Science, Information Systems
Rabia Naseem, Zohaib Amjad Khan, Nitin Satpute, Azeddine Beghdadi, Faouzi Alaya Cheikh, Joaquin Olivares
Summary: The proposed goal-oriented contrast enhancement method improves tumor segmentation performance by enhancing guided image and controlling image quality through optimization.
Article
Computer Science, Information Systems
Geraldo F. Oliveira, Juan Gomez-Luna, Lois Orosa, Saugata Ghose, Nandita Vijaykumar, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu
Summary: Data movement between the CPU and main memory is a major bottleneck for improving performance, scalability, and energy efficiency in modern computer systems. Various techniques have been employed to reduce this overhead, from traditional cache hierarchies to emerging Near-Data Processing (NDP) methods. However, there is still a lack of understanding regarding the key metrics for identifying data movement bottlenecks and their relation to different mitigation mechanisms.
Article
Computer Science, Hardware & Architecture
Jia Ke, Ying Wang, Mingyue Fan, Xiaojun Chen, Wenlong Zhang, Jianping Gou
Summary: This study integrates the emotional correlation analysis model and Self-organizing Map (SOM) to construct fine-grained user emotion vector based on review text and perform visual cluster analysis, which helps platform merchants quickly mine user clustering and characteristics.
COMPUTERS & ELECTRICAL ENGINEERING
(2024)
Article
Computer Science, Hardware & Architecture
Shi Qiu, Huping Ye, Xiaohan Liao, Benyue Zhang, Miao Zhang, Zimu Zeng
Summary: This paper proposes a multilevel-based algorithm for hyperspectral image interpretation, which achieves semantic segmentation through multidimensional information fusion, and introduces a context interpretation module to improve detection performance.
COMPUTERS & ELECTRICAL ENGINEERING
(2024)
Article
Computer Science, Hardware & Architecture
Jianteng Xu, Qingguo Bai, Zhiwen Li, Lili Zhao
Summary: This study constructs two optimization models for the omnichannel closed-loop supply chain by leveraging the combined power of leader-follower game and mean-variance theories. The focus is on analyzing the performance of manufacturers who distribute products through physical stores. The results show that the risk-averse attitude of the physical store has a positive impact on the overall system profitability, but if the introduced physical store belongs to another firm, total profit experiences a decline.
COMPUTERS & ELECTRICAL ENGINEERING
(2024)
Article
Computer Science, Hardware & Architecture
Jiahao Xiong, Weihua Ou, Zhonghua Liu, Jianping Gou, Wenjun Xiao, Haitao Liu
Summary: This paper proposes a novel remote photoplethysmography framework, named GraphPhys, which utilizes graph neural network to extract physiological signals and introduces Average Relative GraphConv for the task of remote physiological signal measurement. Experimental results show that the methods based on GraphPhys significantly outperform the original methods.
COMPUTERS & ELECTRICAL ENGINEERING
(2024)
Article
Computer Science, Hardware & Architecture
Zhiyao Tong, Yiyi Hu, Chi Jiang, Yin Zhang
Summary: The rise of illicit activities involving blockchain digital currencies has become a growing concern. In order to prevent illegal activities, this study combines financial risk control with machine learning to identify and predict the risks of users with poor credit. Experimental results demonstrate high performance in user financial credit analysis.
COMPUTERS & ELECTRICAL ENGINEERING
(2024)