Current Search: Distributed File System (x)
View All Items
- Title
- RESEARCHES ON REVERSE LOOKUP PROBLEM IN DISTRIBUTED FILE SYSTEM.
- Creator
-
Zhang, Junyao, Wang, Jun, University of Central Florida
- Abstract / Description
-
Recent years have witnessed an increasing demand for super data clusters. The super data clusters have reached the petabyte-scale can consist of thousands or tens of thousands storage nodes at a single site. For this architecture, reliability is becoming a great concern. In order to achieve a high reliability, data recovery and node reconstruction is a must. Although extensive research works have investigated how to sustain high performance and high reliability in case of node failures at...
Show moreRecent years have witnessed an increasing demand for super data clusters. The super data clusters have reached the petabyte-scale can consist of thousands or tens of thousands storage nodes at a single site. For this architecture, reliability is becoming a great concern. In order to achieve a high reliability, data recovery and node reconstruction is a must. Although extensive research works have investigated how to sustain high performance and high reliability in case of node failures at large scale, a reverse lookup problem, namely finding the objects list for the failed node remains open. This is especially true for storage systems with high requirement of data integrity and availability, such as scientific research data clusters and etc. Existing solutions are either time consuming or expensive. Meanwhile, replication based block placement can be used to realize fast reverse lookup. However, they are designed for centralized, small-scale storage architectures. In this thesis, we propose a fast and efficient reverse lookup scheme named Group-based Shifted Declustering (G-SD) layout that is able to locate the whole content of the failed node. G-SD extends our previous shifted declustering layout and applies to large-scale file systems. Our mathematical proofs and real-life experiments show that G-SD is a scalable reverse lookup scheme that is up to one order of magnitude faster than existing schemes.
Show less - Date Issued
- 2010
- Identifier
- CFE0003504, ucf:48970
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003504
- Title
- Research on Improving Reliability, Energy Efficiency and Scalability in Distributed and Parallel File Systems.
- Creator
-
Zhang, Junyao, Wang, Jun, Zhang, Shaojie, Lee, Jooheung, University of Central Florida
- Abstract / Description
-
With the increasing popularity of cloud computing and "Big data" applications, current data centers are often required to manage petabytes or exabytes of data. To store this huge amount of data, thousands or tens of thousands storage nodes are required at a single site. This imposes three major challenges for storage system designers: (1) Reliability---node failure in these datacenters is a normal occurrence rather than a rare situation. This makes data reliability a great concern. (2) Energy...
Show moreWith the increasing popularity of cloud computing and "Big data" applications, current data centers are often required to manage petabytes or exabytes of data. To store this huge amount of data, thousands or tens of thousands storage nodes are required at a single site. This imposes three major challenges for storage system designers: (1) Reliability---node failure in these datacenters is a normal occurrence rather than a rare situation. This makes data reliability a great concern. (2) Energy efficiency---a data center can consume up to 100 times more energy than a standard office building. More than 10% of this energy consumption can be attributed to storage systems. Thus, reducing the energy consumption of the storage system is key to reducing the overall consumption of the data center.(3) Scalability---with the continuously increasing size of data, maintaining the scalability of the storage systems is essential. That is, the expansion of the storage system should be completed efficiently and without limitations on the total number of storage nodes or performance.This thesis proposes three ways to improve the above three key features for current large-scale storage systems. Firstly, we define the problem of "reverse lookup", namely finding the list of objects (blocks) for a failed node. As the first step of failure recovery, this process is directly related to the recovery/reconstruction time. While existing solutions use metadata traversal or data distribution reversing methods for reverse lookup, which are either time consuming or expensive, a deterministic block placement can achieve fast and efficient reverse lookup.However, the deterministic placement solutions are designed for centralized, small-scale storage architectures such as RAID etc.. Due to their lacking of scalability, they cannot be directly applied in large-scale storage systems. In this paper, we propose Group-Shifted Declustering (G-SD), a deterministic data layout for multi-way replication. G-SD addresses the scalability issue of our previous Shifted Declustering layout and supports fast and efficient reverse lookup.Secondly, we define a problem: "how to balance the performance, energy, and recovery in degradation mode for an energy efficient storage system?". While extensive researches have been proposed to tradeoff performance for energy efficiency under normal mode, the system enters degradation mode when node failure occurs, in which node reconstruction is initiated. This very process requires a number of disks to be spun up and requires a substantial amount of I/O bandwidth, which will not only compromise energy efficiency but also performance. Without considering the I/O bandwidth contention between recovery and performance, we find that the current energy proportional solutions cannot answer this question accurately. This thesis present PERP, a mathematical model to minimize the energy consumption for a storage systems with respect to performance and recovery. PERP answers this problem by providing the accurate number of nodes and the assigned recovery bandwidth at each time frame.Thirdly, current distributed file systems such as Google File System(GFS) and Hadoop Distributed File System (HDFS), employ a pseudo-random method for replica distribution and a centralized lookup table (block map) to record all replica locations. This lookup table requires a large amount of memory and consumes a considerable amount of CPU/network resources on the metadata server. With the booming size of "Big Data", the metadata server becomes a scalability and performance bottleneck. While current approaches such as HDFS Federation attempt to "horizontally" extend scalability by allowing multiple metadata servers, we believe a more promising optimization option is to "vertically" scale up each metadata server. We propose Deister, a novel block management scheme that builds on top of a deterministic declustering distribution method Intersected Shifted Declustering (ISD). Thus both replica distribution and location lookup can be achieved without a centralized lookup table.
Show less - Date Issued
- 2015
- Identifier
- CFE0006238, ucf:51082
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006238
- Title
- Improvement of Data-Intensive Applications Running on Cloud Computing Clusters.
- Creator
-
Ibrahim, Ibrahim, Bassiouni, Mostafa, Lin, Mingjie, Zhou, Qun, Ewetz, Rickard, Garibay, Ivan, University of Central Florida
- Abstract / Description
-
MapReduce, designed by Google, is widely used as the most popular distributed programmingmodel in cloud environments. Hadoop, an open-source implementation of MapReduce, is a data management framework on large cluster of commodity machines to handle data-intensive applications. Many famous enterprises including Facebook, Twitter, and Adobehave been using Hadoop for their data-intensive processing needs. Task stragglers in MapReduce jobs dramatically impede job execution on massive datasets in...
Show moreMapReduce, designed by Google, is widely used as the most popular distributed programmingmodel in cloud environments. Hadoop, an open-source implementation of MapReduce, is a data management framework on large cluster of commodity machines to handle data-intensive applications. Many famous enterprises including Facebook, Twitter, and Adobehave been using Hadoop for their data-intensive processing needs. Task stragglers in MapReduce jobs dramatically impede job execution on massive datasets in cloud computing systems. This impedance is due to the uneven distribution of input data and computation load among cluster nodes, heterogeneous data nodes, data skew in reduce phase, resource contention situations, and network configurations. All these reasons may cause delay failure and the violation of job completion time. One of the key issues that can significantly affect the performance of cloud computing is the computation load balancing among cluster nodes. Replica placement in Hadoop distributed file system plays a significant role in data availability and the balanced utilization of clusters. In the current replica placement policy (RPP) of Hadoop distributed file system (HDFS), the replicas of data blocks cannot be evenly distributed across cluster's nodes. The current HDFS must rely on a load balancing utility for balancing the distribution of replicas, which results in extra overhead for time and resources. This dissertation addresses data load balancing problem and presents an innovative replica placement policy for HDFS. It can perfectly balance the data load among cluster's nodes. The heterogeneity of cluster nodes exacerbates the issue of computational load balancing; therefore, another replica placement algorithm has been proposed in this dissertation for heterogeneous cluster environments. The timing of identifying the straggler map task is very important for straggler mitigation in data-intensive cloud computing. To mitigate the straggler map task, Present progress and Feedback based Speculative Execution (PFSE) algorithm has been proposed in this dissertation. PFSE is a new straggler identification scheme to identify the straggler map tasks based on the feedback information received from completed tasks beside the progress of the current running task. Straggler reduce task aggravates the violation of MapReduce job completion time. Straggler reduce task is typically the result of bad data partitioning during the reduce phase. The Hash partitioner employed by Hadoop may cause intermediate data skew, which results in straggler reduce task. In this dissertation a new partitioning scheme, named Balanced Data Clusters Partitioner (BDCP), is proposed to mitigate straggler reduce tasks. BDCP is based on sampling of input data and feedback information about the current processing task. BDCP can assist in straggler mitigation during the reduce phase and minimize the job completion time in MapReduce jobs. The results of extensive experiments corroborate that the algorithms and policies proposed in this dissertation can improve the performance of data-intensive applications running on cloud platforms.
Show less - Date Issued
- 2019
- Identifier
- CFE0007818, ucf:52804
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007818