You are here

Developing new power management and High-Reliability Schemes in Data-Intensive Environment

Download pdf | Full Screen View

Date Issued:
2016
Abstract/Description:
With the increasing popularity of data-intensive applications as well as the large-scale computingand storage systems, current data centers and supercomputers are often dealing with extremelylarge data-sets. To store and process this huge amount of data reliably and energy-efficiently,three major challenges should be taken into consideration for the system designers. Firstly, power conservation(-)Multicore processors or CMPs have become a mainstream in the current processormarket because of the tremendous improvement in transistor density and the advancement in semiconductor technology. However, the increasing number of transistors on a single die or chip reveals a super-linear growth in power consumption [4]. Thus, how to balance system performance andpower-saving is a critical issue which needs to be solved effectively. Secondly, system reliability(-)Reliability is a critical metric in the design and development of replication-based big data storagesystems such as Hadoop File System (HDFS). In the system with thousands machines and storagedevices, even in-frequent failures become likely. In Google File System, the annual disk failurerate is 2:88%,which means you were expected to see 8,760 disk failures in a year. Unfortunately,given an increasing number of node failures, how often a cluster starts losing data when beingscaled out is not well investigated. Thirdly, energy efficiency(-)The fast processing speeds of the current generation of supercomputers provide a great convenience to scientists dealing with extremely large data sets. The next generation of (")exascale(") supercomputers could provide accuratesimulation results for the automobile industry, aerospace industry, and even nuclear fusion reactors for the very first time. However, the energy cost of super-computing is extremely high, with a total electricity bill of 9 million dollars per year. Thus, conserving energy and increasing the energy efficiency of supercomputers has become critical in recent years.This dissertation proposes new solutions to address the above three key challenges for currentlarge-scale storage and computing systems. Firstly, we propose a novel power management scheme called MAR (model-free, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption subject to performance constraints. By introducing new I/O wait status, MAR is able to accurately describe the relationship between core frequencies, performance and power consumption. Moreover, we adopt a model-free control method to filter out the I/O wait status from the traditional CPU busy/idle model in order to achieve fast responsiveness to burst situations and take full advantage of power saving. Our extensive experiments on a physical testbed demonstrate that, for SPEC benchmarks and data-intensive (TPC-C) benchmarks, an MAR prototype system achieves 95.8-97.8% accuracy of the ideal power saving strategy calculated offline. Compared with baseline solutions, MAR is able to save 12.3-16.1% more power while maintain a comparable performance loss of about 0.78-1.08%. In addition, more simulation results indicate that our design achieved 3.35-14.2% more power saving efficiency and 4.2-10.7% less performance loss under various CMP configurations as compared with various baseline approaches such as LAST, Relax,PID and MPC.Secondly, we create a new reliability model by incorporating the probability of replica loss toinvestigate the system reliability of multi-way declustering data layouts and analyze their potential parallel recovery possibilities. Our comprehensive simulation results on Matlab and SHARPE show that the shifted declustering data layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by upto 63% and 85% respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that, the shifted declustering layout surpasses the two baseline approaches in both cases by consuming up to 79 % and 87% less recovery bandwidth for copyset, as well as 4.8% and 10.2% less recovery bandwidth for random layout.Thirdly, we develop a power-aware job scheduler by applying a rule based control method and takinginto account real world power and speedup profiles to improve power efficiency while adheringto predetermined power constraints. The intensive simulation results shown that our proposed method is able to achieve the maximum utilization of computing resources as compared to baselinescheduling algorithms while keeping the energy cost under the threshold. Moreover, by introducinga Power Performance Factor (PPF) based on the real world power and speedup profiles, we areable to increase the power efficiency by up to 75%.
Title: Developing new power management and High-Reliability Schemes in Data-Intensive Environment.
30 views
8 downloads
Name(s): Wang, Ruijun, Author
Wang, Jun, Committee Chair
Jin, Yier, Committee Member
DeMara, Ronald, Committee Member
Zhang, Shaojie, Committee Member
Ni, Liqiang, Committee Member
University of Central Florida, Degree Grantor
Type of Resource: text
Date Issued: 2016
Publisher: University of Central Florida
Language(s): English
Abstract/Description: With the increasing popularity of data-intensive applications as well as the large-scale computingand storage systems, current data centers and supercomputers are often dealing with extremelylarge data-sets. To store and process this huge amount of data reliably and energy-efficiently,three major challenges should be taken into consideration for the system designers. Firstly, power conservation(-)Multicore processors or CMPs have become a mainstream in the current processormarket because of the tremendous improvement in transistor density and the advancement in semiconductor technology. However, the increasing number of transistors on a single die or chip reveals a super-linear growth in power consumption [4]. Thus, how to balance system performance andpower-saving is a critical issue which needs to be solved effectively. Secondly, system reliability(-)Reliability is a critical metric in the design and development of replication-based big data storagesystems such as Hadoop File System (HDFS). In the system with thousands machines and storagedevices, even in-frequent failures become likely. In Google File System, the annual disk failurerate is 2:88%,which means you were expected to see 8,760 disk failures in a year. Unfortunately,given an increasing number of node failures, how often a cluster starts losing data when beingscaled out is not well investigated. Thirdly, energy efficiency(-)The fast processing speeds of the current generation of supercomputers provide a great convenience to scientists dealing with extremely large data sets. The next generation of (")exascale(") supercomputers could provide accuratesimulation results for the automobile industry, aerospace industry, and even nuclear fusion reactors for the very first time. However, the energy cost of super-computing is extremely high, with a total electricity bill of 9 million dollars per year. Thus, conserving energy and increasing the energy efficiency of supercomputers has become critical in recent years.This dissertation proposes new solutions to address the above three key challenges for currentlarge-scale storage and computing systems. Firstly, we propose a novel power management scheme called MAR (model-free, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption subject to performance constraints. By introducing new I/O wait status, MAR is able to accurately describe the relationship between core frequencies, performance and power consumption. Moreover, we adopt a model-free control method to filter out the I/O wait status from the traditional CPU busy/idle model in order to achieve fast responsiveness to burst situations and take full advantage of power saving. Our extensive experiments on a physical testbed demonstrate that, for SPEC benchmarks and data-intensive (TPC-C) benchmarks, an MAR prototype system achieves 95.8-97.8% accuracy of the ideal power saving strategy calculated offline. Compared with baseline solutions, MAR is able to save 12.3-16.1% more power while maintain a comparable performance loss of about 0.78-1.08%. In addition, more simulation results indicate that our design achieved 3.35-14.2% more power saving efficiency and 4.2-10.7% less performance loss under various CMP configurations as compared with various baseline approaches such as LAST, Relax,PID and MPC.Secondly, we create a new reliability model by incorporating the probability of replica loss toinvestigate the system reliability of multi-way declustering data layouts and analyze their potential parallel recovery possibilities. Our comprehensive simulation results on Matlab and SHARPE show that the shifted declustering data layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by upto 63% and 85% respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that, the shifted declustering layout surpasses the two baseline approaches in both cases by consuming up to 79 % and 87% less recovery bandwidth for copyset, as well as 4.8% and 10.2% less recovery bandwidth for random layout.Thirdly, we develop a power-aware job scheduler by applying a rule based control method and takinginto account real world power and speedup profiles to improve power efficiency while adheringto predetermined power constraints. The intensive simulation results shown that our proposed method is able to achieve the maximum utilization of computing resources as compared to baselinescheduling algorithms while keeping the energy cost under the threshold. Moreover, by introducinga Power Performance Factor (PPF) based on the real world power and speedup profiles, we areable to increase the power efficiency by up to 75%.
Identifier: CFE0006704 (IID), ucf:51907 (fedora)
Note(s): 2016-08-01
Ph.D.
Engineering and Computer Science, Electrical Engineering and Computer Engineering
Doctoral
This record was generated from author submitted information.
Subject(s): Power management -- Reliability -- Data-Intensive -- Energy-efficient
Persistent Link to This Record: http://purl.flvc.org/ucf/fd/CFE0006704
Restrictions on Access: public 2017-02-15
Host Institution: UCF

In Collections