Reliability Engineering in Cloud Computing: Strategies, Metrics, and Performance Assessment

Karan Anand
International Journal of Multidisciplinary Research in Science, Engineering and Technology (2023)
Google Scholar

Abstract

Cloud computing has transformed the nature of computation, sharing of information resources, and storage capabilities, including the flexibility to scale these resources for corporate use. Nevertheless, maintaining high reliability in cloud environments is still an issue that has not been solved because of factors such as Hardware failures, network interruptions/slowdowns and software vulnerabilities. This paper discusses several methods that can be employed in the reliability engineering of cloud computing, including fault tolerance, redundancy, monitoring and predictive maintenance. It also further extends the basic reliability measures such as Mean Time Between Failure (MTBF), Mean Time To Repair (MTTR), Service Availability and Failure Rate, which measure system reliability and effectiveness. Moreover, the paper considers performance assessment methodologies through real-time monitoring, machine learning, and reliability assessment methods. It also addresses the nature and advancement of technologies of artificial intelligence-powered automation and self-healing applications for improved cloud dependability. The present work aims to identify the state-of-the-art state of dependability in cloud services and propose some recommendations for minimizing such costs, improving dependability levels, and reducing undesired downtime. The information is valuable for CSPs, IT designers/architects, and system engineers who wish to create fault-tolerant and optimal cloud environments.