4.5 Article

Cloud reliability and efficiency improvement via failure risk based proactive actions

Journal

JOURNAL OF SYSTEMS AND SOFTWARE
Volume 163, Issue -, Pages -

Publisher

ELSEVIER SCIENCE INC
DOI: 10.1016/j.jss.2020.110524

Keywords

Cloud computing system; Reliability; Efficiency; Risk identification; Failure mitigation and fault tolerance

Funding

  1. National Basic Research Program (China) [2018YFB1003403]
  2. Natural Science Basic Research Plan in Shaanxi Province of China [2018JM6086]
  3. NSF Net-Centric Software and Systems IUCRC (U.S.)
  4. China Scholarship Council

Ask authors/readers for more resources

Due to the huge magnitude and complexity of cloud computing systems (CCS), failures are inevitable, which lead to reliability and efficiency losses. Failure mitigation, fault tolerance, and recovery actions can be performed to improve CCS reliability and efficiency. Using data collected during CCS operation, failure prediction and risk identification techniques could anticipate such failure occurrences. In this paper, we develop a framework to combine risk identification with follow-up proactive actions for CCS reliability and efficiency improvement. We start by analyzing cloud failures and the related operational data. Then a tree based predictive model is trained to diagnose high risk cloud tasks. By proactively terminating these high risk tasks, both the number of CCS failures and the resource consumption could be significantly reduced. The impact of these proactive actions can be simulated to quantify the improvement to both system reliability and efficiency. The new approach has been applied on the Google cluster dataset, covering approximately 400GB of operational data over 29 consecutive days, to demonstrate its viability and effectiveness. (C) 2020 Published by Elsevier Inc.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available