You are here

Improving the performance of data-intensive computing on Cloud platforms

Download pdf | Full Screen View

Date Issued:
2017
Abstract/Description:
Big Data such as Terabyte and Petabyte datasets are rapidly becoming the new norm for various organizations across a wide range of industries. The widespread data-intensive computing needs have inspired innovations in parallel and distributed computing, which has been the effective way to tackle massive computing workload for decades. One significant example is MapReduce, which is a programming model for expressing distributed computations on huge datasets, and an execution framework for data-intensive computing on commodity clusters as well. Since it was originally proposed by Google, MapReduce has become the most popular technology for data-intensive computing. While Google owns its proprietary implementation of MapReduce, an open source implementation called Hadoop has gained wide adoption in the rest of the world. The combination of Hadoop and Cloud platforms has made data-intensive computing much more accessible and affordable than ever before.This dissertation addresses the performance issue of data-intensive computing on Cloud platforms from three different aspects: task assignment, replica placement, and straggler identification. Both task assignment and replica placement are subjects closely related to load balancing, which is one of the key issues that can significantly affect the performance of parallel and distributed applications. While task assignment schemes strive to balance data processing load among cluster nodes to achieve minimum job completion time, replica placement policies aim to assign block replicas to cluster nodes according to their processing capabilities to exploit data locality to the maximum extent. Straggler identification is also one of the crucial issues data-intensive computing has to deal with, as the overall performance of parallel and distributed applications is often determined by the node with the lowest performance. The results of extensive evaluation tests confirm that the schemes/policies proposed in this dissertation can improve the performance of data-intensive applications running on Cloud platforms.
Title: Improving the performance of data-intensive computing on Cloud platforms.
38 views
13 downloads
Name(s): Dai, Wei, Author
Bassiouni, Mostafa, Committee Chair
Zou, Changchun, Committee Member
Wang, Jun, Committee Member
Lin, Mingjie, Committee Member
Bai, Yuanli, Committee Member
University of Central Florida, Degree Grantor
Type of Resource: text
Date Issued: 2017
Publisher: University of Central Florida
Language(s): English
Abstract/Description: Big Data such as Terabyte and Petabyte datasets are rapidly becoming the new norm for various organizations across a wide range of industries. The widespread data-intensive computing needs have inspired innovations in parallel and distributed computing, which has been the effective way to tackle massive computing workload for decades. One significant example is MapReduce, which is a programming model for expressing distributed computations on huge datasets, and an execution framework for data-intensive computing on commodity clusters as well. Since it was originally proposed by Google, MapReduce has become the most popular technology for data-intensive computing. While Google owns its proprietary implementation of MapReduce, an open source implementation called Hadoop has gained wide adoption in the rest of the world. The combination of Hadoop and Cloud platforms has made data-intensive computing much more accessible and affordable than ever before.This dissertation addresses the performance issue of data-intensive computing on Cloud platforms from three different aspects: task assignment, replica placement, and straggler identification. Both task assignment and replica placement are subjects closely related to load balancing, which is one of the key issues that can significantly affect the performance of parallel and distributed applications. While task assignment schemes strive to balance data processing load among cluster nodes to achieve minimum job completion time, replica placement policies aim to assign block replicas to cluster nodes according to their processing capabilities to exploit data locality to the maximum extent. Straggler identification is also one of the crucial issues data-intensive computing has to deal with, as the overall performance of parallel and distributed applications is often determined by the node with the lowest performance. The results of extensive evaluation tests confirm that the schemes/policies proposed in this dissertation can improve the performance of data-intensive applications running on Cloud platforms.
Identifier: CFE0006731 (IID), ucf:51896 (fedora)
Note(s): 2017-08-01
Ph.D.
Engineering and Computer Science, Electrical Engineering and Computer Engineering
Doctoral
This record was generated from author submitted information.
Subject(s): data-intensive computing -- Cloud computing -- MapReduce -- Hadoop -- task assignment -- replica placement -- load balancing -- straggler identification -- speculative execution
Persistent Link to This Record: http://purl.flvc.org/ucf/fd/CFE0006731
Restrictions on Access: public 2017-08-15
Host Institution: UCF

In Collections