4.7 Article

Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer

Journal

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TPDS.2018.2871189

Keywords

Heterogeneous many-core processor; parallelism; performance analysis; performance-aware; SpGEMM; Sunway TaihuLight supercomputer

Funding

  1. National Key R&D Program of China [2016YFB0200201]
  2. National Outstanding Youth Science Program of National Natural Science Foundation of China [61625202]
  3. International (Regional) Cooperation and Exchange Program of National Natural Science Foundation of China [61661146006, 61860206011]
  4. Program of National Natural Science Foundation of China [61751204, 61806077]

Ask authors/readers for more resources

General sparse matrix-sparse matrix multiplication (SpGEMM) is one of the fundamental linear operations in a wide variety of scientific applications. To implement efficient SpGEMM for many large-scale applications, this paper proposes scalable and optimized SpGEMM kernels based on COO, CSR, ELL, and CSC formats on the Sunway TaihuLight supercomputer. First, a multi-level parallelism design for SpGEMM is proposed to exploit the parallelism of over 10 millions cores and better control memory based on the special Sunway architecture. Optimization strategies, such as load balance, coalesced DMA transmission, data reuse, vectorized computation, and parallel pipeline processing, are applied to further optimize performance of SpGEMM kernels. Second, we thoroughly analyze the performance of the proposed kernels. Third, a performance-aware model for SpGEMM is proposed to select the most appropriate compressed storage formats for the sparse matrices that can achieve the optimal performance of SpGEMM on the Sunway. The experimental results show the SpGEMM kernels have good scalability and meet the challenge of the high-speed computing of large-scale data sets on the Sunway. In addition, the performance-aware model for SpGEMM achieves an absolute value of relative error rate of 8.31 percent on average when the kernels are executed in one single process and achieves 8.59 percent on average when the kernels are executed in multiple processes. It is proved that the proposed performance-aware model can perform at high accuracy and satisfies the precision of selecting the best formats for SpGEMM on the Sunway TaihuLight supercomputer.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Computer Science, Artificial Intelligence

Joint offloading and scheduling decisions for DAG applications in mobile edge computing

Jie Liang, Kenli Li, Chubo Liu, Keqin Li

Summary: Mobile edge computing (MEC) is a promising technology for supporting computation-intensive tasks on mobile devices. A new algorithm called joint re-ordering and frequency scaling (JRFS) is proposed to optimize makespan by considering offloading tasks with precedence constraints among tasks in a MEC environment with multiple servers. Extensive experiments show that JRFS outperforms several other methods in terms of makespan.

NEUROCOMPUTING (2021)

Article Computer Science, Artificial Intelligence

Distributed matrix factorization based on fast optimization for implicit feedback recommendation

Lian Chen, Wangdong Yang, Kenli Li, Keqin Li

Summary: In this study, the UIWMF and DUIWMF recommendation algorithms based on matrix factorization are proposed to effectively handle large-scale implicit feedback data and improve recommendation accuracy. By utilizing weight strategies and parallel learning algorithms, the issues of negative feedback information retrieval and single machine resource constraints are successfully addressed.

JOURNAL OF INTELLIGENT INFORMATION SYSTEMS (2021)

Article Computer Science, Hardware & Architecture

Task migration optimization for guaranteeing delay deadline with mobility consideration in mobile edge computing

Fan Tang, Chubo Liu, Kenli Li, Zhuo Tang, Keqin Li

Summary: This paper focuses on optimizing task migration considering user mobility in mobile edge computing environment to maximize the number of tasks meeting deadlines. By designing a group migration algorithm, significant performance improvements are achieved compared to other common heuristics.

JOURNAL OF SYSTEMS ARCHITECTURE (2021)

Article Computer Science, Artificial Intelligence

Performance analysis of nonlinear activated zeroing neural networks for time-varying matrix pseudoinversion with application

Zeshan Hu, Lin Xiao, Kenli Li, Keqin Li, Jichun Li

Summary: By utilizing simplified nonlinear activation functions, two new SFTZNN models were designed to efficiently solve the time-varying matrix pseudoinversion problem. Theoretical analysis provided maximum convergence time and upper bounds of steady-state residual error in ideal conditions and with external perturbations. Comparative simulations and an engineering application confirmed the feasibility and superiority of the new SFTZNN models.

APPLIED SOFT COMPUTING (2021)

Article Computer Science, Theory & Methods

CASpMV: A Customized and Accelerative SpMV Framework for the Sunway TaihuLight

Guoqing Xiao, Kenli Li, Yuedan Chen, Wangquan He, Albert Y. Zomaya, Tao Li

Summary: This paper introduces a customized and accelerative framework for SpMV on the Sunway, addressing performance limitations. CASpMV shows significant improvement over generic parallel SpMV on the Sunway and exhibits good scalability on multiple CGs.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2021)

Article Computer Science, Information Systems

Multiple local 3D CNNs for region-based prediction in smart cities

Yibi Chen, Xiaofeng Zou, Kenli Li, Keqin Li, Xulei Yang, Cen Chen

Summary: This paper introduces a deep learning-based method for region-based prediction in smart cities, utilizing multiple local 3D CNN spatial-temporal residual networks (LMST3D-ResNet) to extract various temporal dependencies for predicting future citywide activities.

INFORMATION SCIENCES (2021)

Article Computer Science, Theory & Methods

A Game-Based Approach for Cost-Aware Task Assignment With QoS Constraint in Collaborative Edge and Cloud Environments

Saiqin Long, Weifan Long, Zhetao Li, Kenli Li, Yuanqing Xia, Zhuo Tang

Summary: This study focuses on the task assignment problems in collaborative edge and cloud environments, utilizing a distributed, non-cooperative approach. By establishing queuing models and applying game theory to minimize task costs while meeting QoS constraints, Greedy Energy-aware Algorithm and Best Response Algorithm were proposed. The convergence of the algorithms was discussed, and results demonstrate that the BRA algorithm can quickly reach a solution close to Nash equilibrium.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2021)

Article Computer Science, Theory & Methods

Distributed Task Migration Optimization in MEC by Extending Multi-Agent Deep Reinforcement Learning Approach

Chubo Liu, Fan Tang, Yikun Hu, Kenli Li, Zhuo Tang, Keqin Li

Summary: Mobile edge computing (MEC) provides cloud-like capabilities to mobile users, with research focusing on task migration using reinforcement learning algorithms and distributed approaches. Experimental results show that the distributed task migration algorithm can significantly reduce the average completion time of tasks.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2021)

Article Computer Science, Information Systems

Efficient Approaches to Top-r Influential Community Search

Wensheng Luo, Xu Zhou, Jianye Yang, Peng Peng, Guoqing Xiao, Yunjun Gao

Summary: This article aims to efficiently compute the top-r influential communities in a network where nodes may have arbitrary weights by improving existing techniques and developing efficient algorithms to judge the connectivity of nodes with the same weight. Performance studies on real data sets demonstrate the effectiveness and efficiency of the proposed approaches.

IEEE INTERNET OF THINGS JOURNAL (2021)

Article Computer Science, Artificial Intelligence

Task Allocation on Layered Multiagent Systems: When Evolutionary Many-Objective Optimization Meets Deep Q-Learning

Mincan Li, Zidong Wang, Kenli Li, Xiangke Liao, Kate Hone, Xiaohui Liu

Summary: This article introduces a novel layered MAS model that addresses the multitask multiagent allocation problem using deep Q-learning and MSDE methods. The MSDE-SPEA2-based method is proposed to tackle many-objective optimization problem with various objectives like task allocation, completion time, agent satisfaction, etc.

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION (2021)

Article Computer Science, Theory & Methods

fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms

Yuedan Chen, Guoqing Xiao, Kenli Li, Francesco Piccialli, Albert Y. Zomaya

Summary: Sparse matrix-sparse vector multiplication is a fundamental and important operation in high-performance scientific and engineering applications. This paper proposes a fine-grained parallel framework to overcome the challenges of scalability in high-performance computing systems. The framework utilizes multi-stage and hybrid parallelism, adaptive parallel execution, and optimization techniques to accelerate computation and utilize computing resources. Experimental results show significant performance improvements with different input sparsity.

ACM TRANSACTIONS ON PARALLEL COMPUTING (2022)

Article Computer Science, Hardware & Architecture

AAPP: An Accelerative and Adaptive Path Planner for Robots on GPU

Juan Liu, Guoqing Xiao, Fan Wu, Xiangke Liao, Kenli Li

Summary: This paper presents AAPP, an accelerative and adaptive path planner based on RRT* algorithm, to overcome the limitations in bandwidth, load imbalance, high computing complexity, and the choice of parameters. AAPP compresses large-scale map data using simplified compressed sparse rows (SCSR), and proposes a two-layer parallel framework named TLRRT* to exploit GPU computing performance. The paper also designs a two-stage parallel framework named TSRRT* to address load imbalance and computing complexity, and presents optimizations to adaptively select execution schemes and parameters. Experimental results show that AAPP achieves a speedup of up to 22.72x over RRT* algorithm, and can handle large-scale datasets with shorter trajectory lengths.

IEEE TRANSACTIONS ON COMPUTERS (2023)

Article Automation & Control Systems

PH-CF: A Phased Hybrid Algorithm for Accelerating Subgraph Matching Based on CPU-FPGA Heterogeneous Platform

Xian Zhang, Guoqing Xiao, Mingxing Duan, Yuedan Chen, Kenli Li

Summary: Nowadays, subgraph matching, a fundamental problem in various applications, is becoming more and more challenging due to the NP-hardness, explosive growth of graph data, high energy consumption, and overhead of CPU and GPU platforms. To address this issue, we propose a phased hybrid algorithm called PH-CF based on the CPU-FPGA heterogeneous platform, which exploits the pipeline and data flow mechanism, low power consumption, and configurable characteristics of FPGA. Experimental results demonstrate that PH-CF outperforms the state-of-the-arts, providing average performance improvements of up to 16.07x, 38.61x, and 11.46x over CFL, CECI, and DP-iso, respectively, while showing good stability and robustness on various datasets.

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS (2023)

Article Mathematics

An Adaptive Jellyfish Search Algorithm for Packing Items with Conflict

Walaa H. El-Ashmawi, Ahmad Salah, Mahmoud Bekhit, Guoqing Xiao, Khalil Al Ruqeishi, Ahmed Fathalla

Summary: The BPPC is a less-studied variation of the classic combinatorial optimization problem. This work proposes an improved jellyfish metaheuristic algorithm to solve the BPPC by defining jellyfish operations. The proposed method outperforms other comparison methods in terms of the number of bins and the average bin utilization.

MATHEMATICS (2023)

Article Computer Science, Theory & Methods

SGD_Tucker: A Novel Stochastic Optimization Strategy for Parallel Sparse Tucker Decomposition

Hao Li, Zixuan Li, Kenli Li, Jan S. Rellermeyer, Lydia Y. Chen, Keqin Li

Summary: The STD algorithm aims to obtain an optimal low-rank representation feature for sparse tensors but faces the issue of intermediate variables explosion. To address this problem, a novel stochastic optimization strategy called SGD_Tucker is proposed, which shows significant advantages in handling high-dimensional intermediate variables and achieving faster computation speeds in experiments.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2021)

No Data Available