Current Search: Wang, Liqiang (x)
View All Items
- Title
- Towards High-Performance Big Data Processing Systems.
- Creator
-
Zhang, Hong, Wang, Liqiang, Turgut, Damla, Wang, Jun, Zhang, Shunpu, University of Central Florida
- Abstract / Description
-
The amount of generated and stored data has been growing rapidly, It is estimated that 2.5 quintillion bytes of data are generated every day, and 90% of the data in the world today has been created in the last two years. How to solve these big data issues has become a hot topic in both industry and academia.Due to the complex of big data platform, we stratify it into four layers: storage layer, resource management layer, computing layer, and methodology layer. This dissertation proposes brand...
Show moreThe amount of generated and stored data has been growing rapidly, It is estimated that 2.5 quintillion bytes of data are generated every day, and 90% of the data in the world today has been created in the last two years. How to solve these big data issues has become a hot topic in both industry and academia.Due to the complex of big data platform, we stratify it into four layers: storage layer, resource management layer, computing layer, and methodology layer. This dissertation proposes brand-new approaches to address the performance of big data platforms like Hadoop and Spark on these four layers.We first present an improved HDFS design called SMARTH, which optimizes the storage layer. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of ''high performance'' datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2. Secondly, we propose an optimized Hadoop extension called MRapid, which significantly speeds up the execution of short jobs on the resource management layer. It is completely backward compatible to Hadoop, and imposes negligible overhead. Our experiments on Microsoft Azure public cloud show that MRapid can improve performance by up to 88% compared to the original Hadoop.Thirdly, we introduce an efficient 3-level sampling performance model, called Hedgehog, and focus on the relationship between resource and performance. This design is a brand new white-box model for Spark, which is more complex and challenging than Hadoop. In our tool, we employ a Java bytecode manipulation and analysis framework called ASM to reduce the profiling overhead dramatically.Fourthly, on the computing layer, we optimize the current implementation of SGD in Spark's MLlib by reusing data partition for multiple times within a single iteration to find better candidate weights in a more efficient way. Whether using multiple local iterations within each partition is dynamically decided by the 68-95-99.7 rule. We also design a variant of momentum algorithm to optimize step size in every iteration. This method uses a new adaptive rule that decreases the step size whenever neighboring gradients show differing directions of significance. Experiments show that our adaptive algorithm is more efficient and can be 7 times faster compared to the original MLlib's SGD.At last, on the application layer, we present a scalable and distributed geographic information system, called Dart, based on Hadoop and HBase. Dart provides a hybrid table schema to store spatial data in HBase so that the Reduce process can be omitted for operations like calculating the mean center and the median center. It employs reasonable pre-splitting and hash techniques to avoid data imbalance and hot region problems. It also supports massive spatial data analysis like K-Nearest Neighbors (KNN) and Geometric Median Distribution. In our experiments, we evaluate the performance of Dart by processing 160 GB Twitter data on an Amazon EC2 cluster. The experimental results show that Dart is very scalable and efficient.
Show less - Date Issued
- 2018
- Identifier
- CFE0007271, ucf:52208
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007271
- Title
- Compiler Design of a Policy Specification Language for Conditional Gradual Release.
- Creator
-
Kashyap Harinath, Manasa, Leavens, Gary, Turgut, Damla, Wang, Liqiang, University of Central Florida
- Abstract / Description
-
Securing the confidentiality and integrity of information manipulated by computer software is an old yet increasingly important problem. Current software permission systems present on Android or iOS provide inadequate support for developing applications with secure information flow policies. To be useful, information flow control policies need to specify declassifications and the conditions under which declassification must occur. Having these declassifications scattered all over the program...
Show moreSecuring the confidentiality and integrity of information manipulated by computer software is an old yet increasingly important problem. Current software permission systems present on Android or iOS provide inadequate support for developing applications with secure information flow policies. To be useful, information flow control policies need to specify declassifications and the conditions under which declassification must occur. Having these declassifications scattered all over the program makes policies hard to find, which makes auditing difficult. To overcome these challenges, a policy specification language, 'Evidently' is discussed that allows one to specify information flow control policies separately from the program and which supports conditional gradual releases that can be automatically enforced. I discuss the Evidently grammar and modular semantics in detail. Finally, I discuss the implementational details of Evidently compiler within the Xtext language development environment and the implementation's enforcement of policies.
Show less - Date Issued
- 2018
- Identifier
- CFE0007205, ucf:52274
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007205
- Title
- detecting anomalies from big data system logs.
- Creator
-
Lu, Siyang, Wang, Liqiang, Zhang, Shaojie, Zhang, Wei, Wu, Dazhong, University of Central Florida
- Abstract / Description
-
Nowadays, big data systems (e.g., Hadoop and Spark) are being widely adopted by many domains for offering effective data solutions, such as manufacturing, healthcare, education, and media. A common problem about big data systems is called anomaly, e.g., a status deviated from normal execution, which decreases the performance of computation or kills running programs. It is becoming a necessity to detect anomalies and analyze their causes. An effective and economical approach is to analyze...
Show moreNowadays, big data systems (e.g., Hadoop and Spark) are being widely adopted by many domains for offering effective data solutions, such as manufacturing, healthcare, education, and media. A common problem about big data systems is called anomaly, e.g., a status deviated from normal execution, which decreases the performance of computation or kills running programs. It is becoming a necessity to detect anomalies and analyze their causes. An effective and economical approach is to analyze system logs. Big data systems produce numerous unstructured logs that contain buried valuable information. However manually detecting anomalies from system logs is a tedious and daunting task.This dissertation proposes four approaches that can accurately and automatically analyze anomalies from big data system logs without extra monitoring overhead. Moreover, to detect abnormal tasks in Spark logs and analyze root causes, we design a utility to conduct fault injection and collect logs from multiple compute nodes. (1) Our first method is a statistical-based approach that can locate those abnormal tasks and calculate the weights of factors for analyzing the root causes. In the experiment, four potential root causes are considered, i.e., CPU, memory, network, and disk I/O. The experimental results show that the proposed approach is accurate in detecting abnormal tasks as well as finding the root causes. (2) To give a more reasonable probability result and avoid ad-hoc factor weights calculating, we propose a neural network approach to analyze root causes of abnormal tasks. We leverage General Regression Neural Network (GRNN) to identify root causes for abnormal tasks. The likelihood of reported root causes is presented to users according to the weighted factors by GRNN. (3) To further improve anomaly detection by avoiding feature extraction, we propose a novel approach by leveraging Convolutional Neural Networks (CNN). Our proposed model can automatically learn event relationships in system logs and detect anomaly with high accuracy. Our deep neural network consists of logkey2vec embeddings, three 1D convolutional layers, a dropout layer, and max pooling. According to our experiment, our CNN-based approach has better accuracy compared to other approaches using Long Short-Term Memory (LSTM) and Multilayer Perceptron (MLP) on detecting anomaly in Hadoop DistributedFile System (HDFS) logs. (4) To analyze system logs more accurately, we extend our CNN-based approach with two attention schemes to detect anomalies in system logs. The proposed two attention schemes focus on different features from CNN's output. We evaluate our approaches with several benchmarks, and the attention-based CNN model shows the best performance among all state-of-the-art methods.
Show less - Date Issued
- 2019
- Identifier
- CFE0007673, ucf:52499
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007673
- Title
- Managing IO Resource for Co-running Data Intensive Applications in Virtual Clusters.
- Creator
-
Huang, Dan, Wang, Jun, Zhou, Qun, Sun, Wei, Zhang, Shaojie, Wang, Liqiang, University of Central Florida
- Abstract / Description
-
Today Big Data computer platforms employ resource management systems such as Yarn, Torque, Mesos, and Google Borg to enable sharing the physical computing among many users or applications. Given virtualization and resource management systems, users are able to launch their applications on the same node with low mutual interference and management overhead on CPU and memory. However, there are still challenges to be addressed before these systems can be fully adopted to manage the IO resources...
Show moreToday Big Data computer platforms employ resource management systems such as Yarn, Torque, Mesos, and Google Borg to enable sharing the physical computing among many users or applications. Given virtualization and resource management systems, users are able to launch their applications on the same node with low mutual interference and management overhead on CPU and memory. However, there are still challenges to be addressed before these systems can be fully adopted to manage the IO resources in Big Data File Systems (BDFS) and shared network facilities. In this study, we mainly study on three IO management problems systematically, in terms of the proportional sharing of block IO in container-based virtualization, the network IO contention in MPI-based HPC applications and the data migration overhead in HPC workflows. To improve the proportional sharing, we develop a prototype system called BDFS-Container, by containerizing BDFS at Linux block IO level. Central to BDFS-Container, we propose and design a proactive IOPS throttling based mechanism named IOPS Regulator, which improves proportional IO sharing under the BDFS IO pattern by 74.4% on an average. In the aspect of network IO resource management, we exploit using virtual switches to facilitate network traffic manipulation and reduce mutual interference on the network for in-situ applications. In order to dynamically allocate the network bandwidth when it is needed, we adopt SARIMA-based techniques to analyze and predict MPI traffic issued from simulations. Third, to solve the data migration problem in small-medium sized HPC clusters, we propose to construct a sided IO path, named as SideIO, to explicitly direct analysis data to BDFS that co-locates computation with data. By experimenting with two real-world scientific workflows, SideIO completely avoids the most expensive data movement overhead and achieves up to 3x speedups compared with current solutions.
Show less - Date Issued
- 2018
- Identifier
- CFE0007195, ucf:52268
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007195
- Title
- Developing new power management and High-Reliability Schemes in Data-Intensive Environment.
- Creator
-
Wang, Ruijun, Wang, Jun, Jin, Yier, DeMara, Ronald, Zhang, Shaojie, Ni, Liqiang, University of Central Florida
- Abstract / Description
-
With the increasing popularity of data-intensive applications as well as the large-scale computingand storage systems, current data centers and supercomputers are often dealing with extremelylarge data-sets. To store and process this huge amount of data reliably and energy-efficiently,three major challenges should be taken into consideration for the system designers. Firstly, power conservation(-)Multicore processors or CMPs have become a mainstream in the current processormarket because of...
Show moreWith the increasing popularity of data-intensive applications as well as the large-scale computingand storage systems, current data centers and supercomputers are often dealing with extremelylarge data-sets. To store and process this huge amount of data reliably and energy-efficiently,three major challenges should be taken into consideration for the system designers. Firstly, power conservation(-)Multicore processors or CMPs have become a mainstream in the current processormarket because of the tremendous improvement in transistor density and the advancement in semiconductor technology. However, the increasing number of transistors on a single die or chip reveals a super-linear growth in power consumption [4]. Thus, how to balance system performance andpower-saving is a critical issue which needs to be solved effectively. Secondly, system reliability(-)Reliability is a critical metric in the design and development of replication-based big data storagesystems such as Hadoop File System (HDFS). In the system with thousands machines and storagedevices, even in-frequent failures become likely. In Google File System, the annual disk failurerate is 2:88%,which means you were expected to see 8,760 disk failures in a year. Unfortunately,given an increasing number of node failures, how often a cluster starts losing data when beingscaled out is not well investigated. Thirdly, energy efficiency(-)The fast processing speeds of the current generation of supercomputers provide a great convenience to scientists dealing with extremely large data sets. The next generation of (")exascale(") supercomputers could provide accuratesimulation results for the automobile industry, aerospace industry, and even nuclear fusion reactors for the very first time. However, the energy cost of super-computing is extremely high, with a total electricity bill of 9 million dollars per year. Thus, conserving energy and increasing the energy efficiency of supercomputers has become critical in recent years.This dissertation proposes new solutions to address the above three key challenges for currentlarge-scale storage and computing systems. Firstly, we propose a novel power management scheme called MAR (model-free, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption subject to performance constraints. By introducing new I/O wait status, MAR is able to accurately describe the relationship between core frequencies, performance and power consumption. Moreover, we adopt a model-free control method to filter out the I/O wait status from the traditional CPU busy/idle model in order to achieve fast responsiveness to burst situations and take full advantage of power saving. Our extensive experiments on a physical testbed demonstrate that, for SPEC benchmarks and data-intensive (TPC-C) benchmarks, an MAR prototype system achieves 95.8-97.8% accuracy of the ideal power saving strategy calculated offline. Compared with baseline solutions, MAR is able to save 12.3-16.1% more power while maintain a comparable performance loss of about 0.78-1.08%. In addition, more simulation results indicate that our design achieved 3.35-14.2% more power saving efficiency and 4.2-10.7% less performance loss under various CMP configurations as compared with various baseline approaches such as LAST, Relax,PID and MPC.Secondly, we create a new reliability model by incorporating the probability of replica loss toinvestigate the system reliability of multi-way declustering data layouts and analyze their potential parallel recovery possibilities. Our comprehensive simulation results on Matlab and SHARPE show that the shifted declustering data layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by upto 63% and 85% respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that, the shifted declustering layout surpasses the two baseline approaches in both cases by consuming up to 79 % and 87% less recovery bandwidth for copyset, as well as 4.8% and 10.2% less recovery bandwidth for random layout.Thirdly, we develop a power-aware job scheduler by applying a rule based control method and takinginto account real world power and speedup profiles to improve power efficiency while adheringto predetermined power constraints. The intensive simulation results shown that our proposed method is able to achieve the maximum utilization of computing resources as compared to baselinescheduling algorithms while keeping the energy cost under the threshold. Moreover, by introducinga Power Performance Factor (PPF) based on the real world power and speedup profiles, we areable to increase the power efficiency by up to 75%.
Show less - Date Issued
- 2016
- Identifier
- CFE0006704, ucf:51907
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006704
- Title
- Leaning Robust Sequence Features via Dynamic Temporal Pattern Discovery.
- Creator
-
Hu, Hao, Wang, Liqiang, Zhang, Shaojie, Liu, Fei, Qi, GuoJun, Zhou, Qun, University of Central Florida
- Abstract / Description
-
As a major type of data, time series possess invaluable latent knowledge for describing the real world and human society. In order to improve the ability of intelligent systems for understanding the world and people, it is critical to design sophisticated machine learning algorithms for extracting robust time series features from such latent knowledge. Motivated by the successful applications of deep learning in computer vision, more and more machine learning researchers put their attentions...
Show moreAs a major type of data, time series possess invaluable latent knowledge for describing the real world and human society. In order to improve the ability of intelligent systems for understanding the world and people, it is critical to design sophisticated machine learning algorithms for extracting robust time series features from such latent knowledge. Motivated by the successful applications of deep learning in computer vision, more and more machine learning researchers put their attentions on the topic of applying deep learning techniques to time series data. However, directly employing current deep models in most time series domains could be problematic. A major reason is that temporal pattern types that current deep models are aiming at are very limited, which cannot meet the requirement of modeling different underlying patterns of data coming from various sources. In this study we address this problem by designing different network structures explicitly based on specific domain knowledge such that we can extract features via most salient temporal patterns. More specifically, we mainly focus on two types of temporal patterns: order patterns and frequency patterns. For order patterns, which are usually related to brain and human activities, we design a hashing-based neural network layer to globally encode the ordinal pattern information into the resultant features. It is further generalized into a specially designed Recurrent Neural Networks (RNN) cell which can learn order patterns in an online fashion. On the other hand, we believe audio-related data such as music and speech can benefit from modeling frequency patterns. Thus, we do so by developing two types of RNN cells. The first type tries to directly learn the long-term dependencies on frequency domain rather than time domain. The second one aims to dynamically filter out the ``noise" frequencies based on temporal contexts. By proposing various deep models based on different domain knowledge and evaluating them on extensive time series tasks, we hope this work can provide inspirations for others and increase the community's interests on the problem of applying deep learning techniques to more time series tasks.
Show less - Date Issued
- 2019
- Identifier
- CFE0007470, ucf:52679
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007470
- Title
- Arterial-level real-time safety evaluation in the context of proactive traffic management.
- Creator
-
Yuan, Jinghui, Abdel-Aty, Mohamed, Eluru, Naveen, Hasan, Samiul, Cai, Qing, Wang, Liqiang, University of Central Florida
- Abstract / Description
-
In the context of pro-active traffic management, real-time safety evaluation is one of the most important components. Previous studies on real-time safety analysis mainly focused on freeways, seldom on arterials. With the advancement of sensing technologies and smart city initiative, more and more real-time traffic data sources are available on arterials, which enables us to evaluate the real-time crash risk on arterials. However, there exist substantial differences between arterials and...
Show moreIn the context of pro-active traffic management, real-time safety evaluation is one of the most important components. Previous studies on real-time safety analysis mainly focused on freeways, seldom on arterials. With the advancement of sensing technologies and smart city initiative, more and more real-time traffic data sources are available on arterials, which enables us to evaluate the real-time crash risk on arterials. However, there exist substantial differences between arterials and freeways in terms of traffic flow characteristics, data availability, and even crash mechanism. Therefore, this study aims to deeply evaluate the real-time crash risk on arterials from multiple aspects by integrating all kinds of available data sources. First, Bayesian conditional logistic models (BCL) were developed to examine the relationship between crash occurrence on arterial segments and real-time traffic and signal timing characteristics by incorporating the Bluetooth, adaptive signal control, and weather data, which were extracted from four urban arterials in Central Florida. Second, real-time intersection-approach-level crash risk was investigated by considering the effects of real-time traffic, signal timing, and weather characteristics based on 23 signalized intersections in Orange County. Third, a deep learning algorithm for real-time crash risk prediction at signalized intersections was proposed based on Long Short-Term Memory (LSTM) and Synthetic Minority Over-Sampling Technique (SMOTE). Moreover, in-depth cycle-level real-time crash risk at signalized intersections was explored based on high-resolution event-based data (i.e., Automated Traffic Signal Performance Measures (ATSPM)). All the possible real-time cycle-level factors were considered, including traffic volume, signal timing, headway and occupancy, traffic variation between upstream and downstream detectors, shockwave characteristics, and weather conditions. Above all, comprehensive real-time safety evaluation algorithms were developed for arterials, which would be key components for future real-time safety applications (e.g., real-time crash risk prediction and visualization system) in the context of pro-active traffic management.
Show less - Date Issued
- 2019
- Identifier
- CFE0007743, ucf:52398
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007743
- Title
- Color-Ratio Based Strawberry Plant Localization and Nutrition Deficiency Detection.
- Creator
-
Kong, Xiangling, Xu, Yunjun, Elgohary, Tarek, Fu, Qiushi, Wu, Dazhong, Wang, Liqiang, University of Central Florida
- Abstract / Description
-
In recent years, precision agriculture has become popular anticipating to partially meet the needs of an ever-growing population with limited resources. Plant localization and nutrient de?ciency detection are two important tasks in precision agriculture. In this dissertation, these two tasks are studied by using a new color-ratio(C-R) index technique. Firstly, a low cost and light scene invariant approach is proposed to detect green and yellow leaves based on the color-ratio (C-R) indices. A...
Show moreIn recent years, precision agriculture has become popular anticipating to partially meet the needs of an ever-growing population with limited resources. Plant localization and nutrient de?ciency detection are two important tasks in precision agriculture. In this dissertation, these two tasks are studied by using a new color-ratio(C-R) index technique. Firstly, a low cost and light scene invariant approach is proposed to detect green and yellow leaves based on the color-ratio (C-R) indices. A plant localization approach is then developed using the relative pixel relationships of adjacent plants. Secondly, the Sobel operator and morphology techniques are applied to segment the target strawberry leaf from a ?eld image. The characterized color for a speci?c nutrient de?ciency is detected by the C-R indices. The pattern of the detected color on the leaf is then examined to determine the speci?c nutrient de?ciency. The proposed approaches are validated in a commercial strawberry farm.
Show less - Date Issued
- 2019
- Identifier
- CFE0007666, ucf:52482
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007666