You are here

CONCENTRIC LAYOUT, A NEW SCIENTIFIC DATA LAYOUT FOR MATRIX DATA SET IN HADOOP FILE SYSTEM

Download pdf | Full Screen View

Date Issued:
2010
Abstract/Description:
The data generated by scientific simulation, sensor, monitor or optical telescope has increased with dramatic speed. In order to analyze the raw data speed and space efficiently, data pre-process operation is needed to achieve better performance in data analysis phase. Current research shows an increasing tread of adopting MapReduce framework for large scale data processing. However, the data access patterns which generally applied to scientific data set are not supported by current MapReduce framework directly. The gap between the requirement from analytics application and the property of MapReduce framework motivates us to provide support for these data access patterns in MapReduce framework. In our work, we studied the data access patterns in matrix files and proposed a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout which adopted in current Hadoop framework by default, concentric data layout stores the data from the same sub-matrix into one chunk. This matches well with the matrix operations like computation. The concentric data layout preprocesses the data beforehand, and optimizes the afterward run of MapReduce application. The experiments indicate that the concentric data layout improves the overall performance, reduces the execution time by 38% when the file size is 16 GB, also it relieves the data overhead phenomenon and increases the effective data retrieval rate by 32% on average.
Title: CONCENTRIC LAYOUT, A NEW SCIENTIFIC DATA LAYOUT FOR MATRIX DATA SET IN HADOOP FILE SYSTEM.
54 views
20 downloads
Name(s): cheng, lu, Author
wang, jun, Committee Chair
University of Central Florida, Degree Grantor
Type of Resource: text
Date Issued: 2010
Publisher: University of Central Florida
Language(s): English
Abstract/Description: The data generated by scientific simulation, sensor, monitor or optical telescope has increased with dramatic speed. In order to analyze the raw data speed and space efficiently, data pre-process operation is needed to achieve better performance in data analysis phase. Current research shows an increasing tread of adopting MapReduce framework for large scale data processing. However, the data access patterns which generally applied to scientific data set are not supported by current MapReduce framework directly. The gap between the requirement from analytics application and the property of MapReduce framework motivates us to provide support for these data access patterns in MapReduce framework. In our work, we studied the data access patterns in matrix files and proposed a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout which adopted in current Hadoop framework by default, concentric data layout stores the data from the same sub-matrix into one chunk. This matches well with the matrix operations like computation. The concentric data layout preprocesses the data beforehand, and optimizes the afterward run of MapReduce application. The experiments indicate that the concentric data layout improves the overall performance, reduces the execution time by 38% when the file size is 16 GB, also it relieves the data overhead phenomenon and increases the effective data retrieval rate by 32% on average.
Identifier: CFE0003537 (IID), ucf:48955 (fedora)
Note(s): 2010-12-01
M.S.
Engineering and Computer Science, School of Electrical Engineering and Computer Science
Masters
This record was generated from author submitted information.
Subject(s): Hadoop File System
Data Layout
Persistent Link to This Record: http://purl.flvc.org/ucf/fd/CFE0003537
Restrictions on Access: public
Host Institution: UCF

In Collections