4.5 Article

A Spark-based Apriori algorithm with reduced shuffle overhead

Journal

JOURNAL OF SUPERCOMPUTING
Volume 77, Issue 1, Pages 133-151

Publisher

SPRINGER
DOI: 10.1007/s11227-020-03253-7

Keywords

Apache Spark; Apriori algorithm; Large-scale datasets; Shuffle overhead

Funding

  1. IIT(ISM), Govt. of India, Dhanbad
  2. Department of Computer Science & Engineering, Indian Institute of Technology (ISM), Dhanbad, India

Ask authors/readers for more resources

This paper introduces a Spark-based Apriori algorithm called SARSO, which improves efficiency by reducing shuffle overhead caused by RDD operations. The method restricts the movement of key-value pairs across cluster nodes, reducing necessary communication and synchronization overhead incurred by the Spark shuffle operation.
Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark's parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Computer Science, Information Systems

A Spark-based high utility itemset mining with multiple external utilities

Krishan Kumar Sethi, Dharavath Ramesh, Munesh Chandra Trivedi

Summary: HUI mining is a data mining technique to discover profitable patterns, and this research proposes new strategies and a distributed algorithm to make it suitable for big data processing. Experimental results demonstrate that the proposed algorithm outperforms existing algorithms.

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS (2022)

Article Computer Science, Information Systems

Blockchain assisted privacy-preserving public auditable model for cloud environment with efficient user revocation

Rahul Mishra, Dharavath Ramesh, Damodar Reddy Edla, Munesh Chandra Trivedi

Summary: Cloud storage offers efficient data management, but security concerns arise. Public auditing models, utilizing third-party auditors, have been developed to address data integrity issues. However, these models are vulnerable to procrastinating auditors. This paper introduces a blockchain-based methodology, employing a certificateless public auditing model, to combat malicious and procrastinating auditors with efficient user revocation.

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS (2022)

Article Automation & Control Systems

Machine Learning Regression for RF Path Loss Estimation Over Grass Vegetation in IoWSN Monitoring Infrastructure

Pankaj Pal, Rashmi Priya Sharma, Sachin Tripathi, Chiranjeev Kumar, Dharavath Ramesh

Summary: This proposal investigates the impact of grass vegetation elevation and density on path loss in an IoT-enabled wireless sensor network for crop monitoring. Real-time measurements at different node heights and vegetation depths reveal that using a free-space or tree-based path loss model leads to network disconnections due to changes in vegetation density throughout a crop growth cycle. An empirical path loss model is formulated to estimate signal strength during different development phases of medium grass vegetation. The 2.4 GHz RF path loss coefficient is estimated using collected data, and a generic path loss model is developed through multiple regression analysis. The effectiveness of the model is validated through proof of concept experiments.

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS (2022)

Article Automation & Control Systems

Interaction-Enhanced and Time-Aware Graph Convolutional Network for Successive Point-of-Interest Recommendation in Traveling Enterprises

Yuwen Liu, Huiping Wu, Khosro Rezaee, Mohammad R. Khosravi, Osamah Ibrahim Khalaf, Arif Ali Khan, Dharavath Ramesh, Lianyong Qi

Summary: In this study, an Interaction-enhanced and Time-aware Graph Convolution Network (ITGCN) is proposed for successive point-of-interest (POI) recommendation. By using an improved graph convolution network and a self-attention aggregator, the dynamic representation of users and POIs can be learned, capturing high-order connectivity. Experimental results show that ITGCN outperforms existing methods.

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS (2023)

Article Computer Science, Information Systems

Enabling Efficient Deduplication and Secure Decentralized Public Auditing for Cloud Storage: A Redactable Blockchain Approach

Rahul Mishra, Dharavath Ramesh, Salil S. Kanhere, Damodar Reddy Edla

Summary: This paper introduces a blockchain-based secure decentralized public auditing model and an efficient deduplication scheme. By using blockchain instead of a centralized third-party auditor, it reduces the waste of computational and storage resources. By employing redactability to address security issues and efficient deduplication scheme, it achieves storage savings and data protection.

ACM TRANSACTIONS ON MANAGEMENT INFORMATION SYSTEMS (2023)

Article Agronomy

Evaluation of metaheuristic optimization algorithms for optimal allocation of surface water and groundwater resources for crop production

Sonal Jain, Dharavath Ramesh, Munesh C. Trivedi, Damodar Reddy Edla

Summary: Given the extensive variability in current climate conditions, it is important to plan water resources optimally to efficiently manage socio-economic and environmental requirements. This study introduced a multi-objective model to maximize crop net return and effectively manage water resources. The model was applied to a case study in the Pennar-Palar-Cauvery link canal command in India, and three meta-heuristic approaches were employed to solve the model and evaluate their performance.

AGRICULTURAL WATER MANAGEMENT (2023)

Article Automation & Control Systems

Intelligent Salp Swarm Scheduler With Fitness Based Quasi-Reflection Method for Scientific Workflows in Hybrid Cloud-Fog Environment

Naela Rizvi, Dharavath Ramesh, P. C. Srinivasa Rao, Koushik Mondal

Summary: This study proposes an intelligent fuzzy scheduler that utilizes the salp swarm algorithm to learn and optimize fuzzy task-resource allocation rules. It addresses complex and uncertain computation offloading problems in fog computing. Experimental results demonstrate that the proposed approach outperforms other classical algorithms in workflow scheduling problems.

IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING (2023)

Article Engineering, Electrical & Electronic

NSGA-III Based Heterogeneous Transmission Range Selection for Node Deployment in IEEE 802.15.4 Infrastructure for Sugarcane and Rice Crop Monitoring in a Humid Sub-Tropical Region

Pankaj Pal, Rashmi Priya Sharma, Sachin Tripathi, Chiranjeev Kumar, Dharavath Ramesh

Summary: This proposal analyzes the impact of varying vegetation density on the Received Signal Strength (RSS), coverage, and energy consumption of an IoT assisted Wireless Sensor Network (IoWSN) through a measurement campaign. The study suggests an empirically formulated Path Loss Model (PLM) to estimate excess attenuation and performs a Non-dominated Sorting Genetic Algorithm (NSGA-III) optimization for initial node deployment with a heterogeneous transmission range. Transmitter output power scheduling is used to minimize over-coverage by dynamically adjusting the power based on changes in the captured RSS. The Proof of Concept validates the improvements in coverage, connectivity, and energy efficiency compared to existing approaches.

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS (2023)

Article Computer Science, Information Systems

A Workflow Scheduling Approach With Modified Fuzzy Adaptive Genetic Algorithm in IaaS Clouds

Naela Rizvi, Dharavath Ramesh, Lipo Wang, Annappa Basava

Summary: This article introduces an algorithm called MFGA (Modified Fuzzy Adaptive Genetic Algorithm) to minimize the makespan and improve resource utilization of workflows under deadline and budget constraints. The algorithm utilizes a fuzzy logic controller to control crossover and mutation rates and incorporates novel crossover and mutation techniques. Simulation experiments demonstrate that MFGA outperforms other state-of-the-art algorithms.

IEEE TRANSACTIONS ON SERVICES COMPUTING (2023)

Review Agronomy

Land Resources in Organic Agriculture: Trends and Challenges in the Twenty-First Century from Global to Croatian Contexts

Gabrijel Ondrasek, Jelena Horvatinec, Marina Bubalo Kovacic, Marko Reljic, Marko Vincekovic, Santosha Rathod, Nirmala Bandumula, Ramesh Dharavath, Muhammad Imtiaz Rashid, Olga Panfilova, Kodikara Arachchilage Sunanda Kodikara, Jasmina Defterdarovic, Vedran Krevh, Vilim Filipovic, Lana Filipovic, Tajana Cop, Mario Njavro

Summary: Organic agriculture is an increasingly popular global concept that focuses on sustainable and environmentally-friendly practices. It has the potential to improve ecosystems, reduce pollution, and provide safe and nutritious food. This study reviews the global utilization of land resources in organic agriculture, with a focus on EU countries, and highlights the challenges and opportunities for expanding organic farming.

AGRONOMY-BASEL (2023)

Article Computer Science, Cybernetics

FDGNN: Feature-Aware Disentangled Graph Neural Network for Recommendation

Xiao Liu, Shunmei Meng, Qianmu Li, Qiyan Liu, Qiang He, Dharavath Ramesh, Lianyong Qi

Summary: This article proposes a new feature-aware disentangled graph neural network (FDGNN) model for recommendation, aiming to achieve better recommendation performance and model interpretability by learning the relationship between user behavior and important features of items.

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS (2023)

Article Computer Science, Information Systems

IoFT-FIS: Internet of farm things based prediction for crop pest infestation using optimized fuzzy inference system

Rashmi Priya Sharma, Ramesh Dharavath, Damodar R. Edla

Summary: Advanced farming techniques combined with IoT-compatible crop monitoring and data collection systems can enhance agricultural productivity by understanding environmental conditions, identifying crop diseases, and optimizing planting seasons. By analyzing data collected through an IoT monitoring system, the impact of weather parameters on crop yield and pest breeding conditions can be determined. The proposed fuzzy inference system uses fuzzy rules to find suitable cropping windows and low pest breeding conditions, benefiting farmers in achieving maximum yields.

INTERNET OF THINGS (2023)

Proceedings Paper Computer Science, Interdisciplinary Applications

MySQL Collaboration by Approving and Tracking Updates with Dependencies: A Versioning Approach

Dharavath Ramesh, Munesh Chandra Trivedi

Summary: Data science has a growing demand for efficient collaboration in analyzing and manipulating large-scale datasets. The current ad-hoc versioning mechanism is no longer sufficient, thus a framework implemented on top of relational databases is proposed to enable efficient management and querying of dataset versions.

COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2022 WORKSHOPS, PART V (2022)

Article Computer Science, Interdisciplinary Applications

DS-Chain: A secure and auditable multi-cloud assisted EHR storage model on efficient deletable blockchain

Rahul Mishra, Dharavath Ramesh, Damodar Reddy Edla, Lianyong Qi

Summary: In recent years, cloud storage service has gained popularity in the healthcare industry. Outsourcing EHRs to the cloud provides scalability, flexibility, low-cost operations, and availability, but also raises security concerns. This study proposes a secure EHR storage model based on a consortium blockchain to ensure confidentiality, integrity, and correctness by integrating EHR outsourcing operations into blockchain transactions.

JOURNAL OF INDUSTRIAL INFORMATION INTEGRATION (2022)

Article Computer Science, Hardware & Architecture

NSGA-2 Optimized Fuzzy Inference System for Crop Plantation Correctness Index Identification

Rashmi Priya, Dharavath Ramesh, Venkanna Udutalapally

Summary: Advanced technology in agriculture can increase yield by understanding suitable environmental conditions, soil health status, water and fertilizer requirements, and crop monitoring. This study proposes a rule-based fuzzy classification method for predicting sowing time, optimizing the rule base, and correlating the fuzziness of sowing slots with yield to measure the model's effectiveness.

IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING (2022)

No Data Available