Current Search: Zhang, Shaojie (x)
View All Items
Pages
- Title
- IMPROVING FMRI CLASSIFICATION THROUGH NETWORK DECONVOLUTION.
- Creator
-
Martinek, Jacob, Zhang, Shaojie, University of Central Florida
- Abstract / Description
-
The structure of regional correlation graphs built from fMRI-derived data is frequently used in algorithms to automatically classify brain data. Transformation on the data is performed during pre-processing to remove irrelevant or inaccurate information to ensure that an accurate representation of the subject's resting-state connectivity is attained. Our research suggests and confirms that such pre-processed data still exhibits inherent transitivity, which is expected to obscure the true...
Show moreThe structure of regional correlation graphs built from fMRI-derived data is frequently used in algorithms to automatically classify brain data. Transformation on the data is performed during pre-processing to remove irrelevant or inaccurate information to ensure that an accurate representation of the subject's resting-state connectivity is attained. Our research suggests and confirms that such pre-processed data still exhibits inherent transitivity, which is expected to obscure the true relationships between regions. This obfuscation prevents known solutions from developing an accurate understanding of a subject's functional connectivity. By removing correlative transitivity, connectivity between regions is made more specific and automated classification is expected to improve. The task of utilizing fMRI to automatically diagnose Attention Deficit/Hyperactivity Disorder was posed by the ADHD-200 Consortium in a competition to draw in researchers and new ideas from outside of the neuroimaging discipline. Researchers have since worked with the competition dataset to produce ever-increasing detection rates. Our approach was empirically tested with a known solution to this problem to compare processing of treated and untreated data, and the detection rates were shown to improve in all cases with a weighted average increase of 5.88%.
Show less - Date Issued
- 2015
- Identifier
- CFH0004895, ucf:45410
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFH0004895
- Title
- NEW COMPUTATIONAL APPROACHES FOR MULTIPLE RNA ALIGNMENT AND RNA SEARCH.
- Creator
-
DeBlasio, Daniel, Zhang, Shaojie, University of Central Florida
- Abstract / Description
-
In this thesis we explore the the theory and history behind RNA alignment. Normal sequence alignments as studied by computer scientists can be completed in $O(n^2)$ time in the naive case. The process involves taking two input sequences and finding the list of edits that can transform one sequence into the other. This process is applied to biology in many forms, such as the creation of multiple alignments and the search of genomic sequences. When you take into account the RNA sequence...
Show moreIn this thesis we explore the the theory and history behind RNA alignment. Normal sequence alignments as studied by computer scientists can be completed in $O(n^2)$ time in the naive case. The process involves taking two input sequences and finding the list of edits that can transform one sequence into the other. This process is applied to biology in many forms, such as the creation of multiple alignments and the search of genomic sequences. When you take into account the RNA sequence structure the problem becomes even harder. Multiple RNA structure alignment is particularly challenging because covarying mutations make sequence information alone insufficient. Existing tools for multiple RNA alignments first generate pair-wise RNA structure alignments and then build the multiple alignment using only the sequence information. Here we present PMFastR, an algorithm which iteratively uses a sequence-structure alignment procedure to build a multiple RNA structure alignment. PMFastR also has low memory consumption allowing for the alignment of large sequences such as 16S and 23S rRNA. Specifically, we reduce the memory consumption to $\sim O(band^2*m)$ where $band$ is the banding size. Other solutions are $\sim O(n^2*m)$ where $n$ and $m$ are the lengths of the target and query respectively. The algorithm also provides a method to utilize a multi-core environment. We present results on benchmark data sets from BRAliBase, which shows PMFastR outperforms other state-of-the-art programs. Furthermore, we regenerate 607 Rfam seed alignments and show that our automated process creates similar multiple alignments to the manually-curated Rfam seed alignments. While these methods can also be applied directly to genome sequence search, the abundance of new multiple species genome alignments presents a new area for exploration. Many multiple alignments of whole genomes are available and these alignments keep growing in size. These alignments can provide more information to the searcher than just a single sequence. Using the methodology from sequence-structure alignment we developed AlnAlign, which searches an entire genome alignment using RNA sequence structure. While programs have been readily available to align alignments, this is the first to our knowledge that is specifically designed for RNA sequences. This algorithm is presented only in theory and is yet to be tested.
Show less - Date Issued
- 2009
- Identifier
- CFE0002736, ucf:48166
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002736
- Title
- Predicting Students' Academic Performance with Decision Tree and Neural Network.
- Creator
-
Feng, Junshuai, Jha, Sumit Kumar, Zhang, Wei, Zhang, Shaojie, University of Central Florida
- Abstract / Description
-
Educational Data Mining (EDM) is a developing research field that involves many techniques to explore data relating to educational background. EDM can analyze and resolve educational data with computational methods to address educational questions. Similar to EDM, neural networks have been utilized in widespread and successful data mining applications. In this paper, synthetic datasets are employed since this paper aims to explore the methodologies such as decision tree classifiers and neural...
Show moreEducational Data Mining (EDM) is a developing research field that involves many techniques to explore data relating to educational background. EDM can analyze and resolve educational data with computational methods to address educational questions. Similar to EDM, neural networks have been utilized in widespread and successful data mining applications. In this paper, synthetic datasets are employed since this paper aims to explore the methodologies such as decision tree classifiers and neural networks to predict student performance in the context of EDM. Firstly, it introduces EDM and some relative works that have been accomplished previously in this field along with their datasets and computational results. Then, it demonstrates how the synthetic student dataset is generated, analyzes some input attributes from the dataset such as gender and high school GPA, and delivers with some visualization results to determine which classification methods approaches are the most efficient. After testing the data with decision tree classifiers and neural networks methodologies, it concludes the effectiveness of both approaches in terms of the model evaluation performance as well as discussing some of the most promising future work of this research.
Show less - Date Issued
- 2019
- Identifier
- CFE0007455, ucf:52680
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007455
- Title
- Decision-making for Vehicle Path Planning.
- Creator
-
Xu, Jun, Turgut, Damla, Zhang, Shaojie, Zhang, Wei, Hasan, Samiul, University of Central Florida
- Abstract / Description
-
This dissertation presents novel algorithms for vehicle path planning in scenarios where the environment changes. In these dynamic scenarios the path of the vehicle needs to adapt to changes in the real world. In these scenarios, higher performance paths can be achieved if we are able to predict the future state of the world, by learning the way it evolves from historical data. We are relying on recent advances in the field of deep learning and reinforcement learning to learn appropriate...
Show moreThis dissertation presents novel algorithms for vehicle path planning in scenarios where the environment changes. In these dynamic scenarios the path of the vehicle needs to adapt to changes in the real world. In these scenarios, higher performance paths can be achieved if we are able to predict the future state of the world, by learning the way it evolves from historical data. We are relying on recent advances in the field of deep learning and reinforcement learning to learn appropriate world models and path planning behaviors.There are many different practical applications that map to this model. In this dissertation we propose algorithms for two applications that are very different in domain but share important formal similarities: the scheduling of taxi services in a large city and tracking wild animals with an unmanned aerial vehicle.The first application models a centralized taxi dispatch center in a big city. It is a multivariate optimization problem for taxi time scheduling and path planning. The first goal here is to balance the taxi service demand and supply ratio in the city. The second goal is to minimize passenger waiting time and taxi idle driving distance. We design different learning models that capture taxi demand and destination distribution patterns from historical taxi data. The predictions are evaluated with real-world taxi trip records. The predicted taxi demand and destination is used to build a taxi dispatch model. The taxi assignment and re-balance is optimized by solving a Mixed Integer Programming (MIP) problem.The second application concerns animal monitoring using an unmanned aerial vehicle (UAV) to search and track wild animals in a large geographic area. We propose two different path planing approaches for the UAV. The first one is based on the UAV controller solving Markov decision process (MDP). The second algorithms relies on the past recorded animal appearances. We designed a learning model that captures animal appearance patterns and predicts the distribution of future animal appearances. We compare the proposed path planning approaches with traditional methods and evaluated them in terms of collected value of information (VoI), message delay and percentage of events collected.
Show less - Date Issued
- 2019
- Identifier
- CFE0007557, ucf:52606
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007557
- Title
- detecting anomalies from big data system logs.
- Creator
-
Lu, Siyang, Wang, Liqiang, Zhang, Shaojie, Zhang, Wei, Wu, Dazhong, University of Central Florida
- Abstract / Description
-
Nowadays, big data systems (e.g., Hadoop and Spark) are being widely adopted by many domains for offering effective data solutions, such as manufacturing, healthcare, education, and media. A common problem about big data systems is called anomaly, e.g., a status deviated from normal execution, which decreases the performance of computation or kills running programs. It is becoming a necessity to detect anomalies and analyze their causes. An effective and economical approach is to analyze...
Show moreNowadays, big data systems (e.g., Hadoop and Spark) are being widely adopted by many domains for offering effective data solutions, such as manufacturing, healthcare, education, and media. A common problem about big data systems is called anomaly, e.g., a status deviated from normal execution, which decreases the performance of computation or kills running programs. It is becoming a necessity to detect anomalies and analyze their causes. An effective and economical approach is to analyze system logs. Big data systems produce numerous unstructured logs that contain buried valuable information. However manually detecting anomalies from system logs is a tedious and daunting task.This dissertation proposes four approaches that can accurately and automatically analyze anomalies from big data system logs without extra monitoring overhead. Moreover, to detect abnormal tasks in Spark logs and analyze root causes, we design a utility to conduct fault injection and collect logs from multiple compute nodes. (1) Our first method is a statistical-based approach that can locate those abnormal tasks and calculate the weights of factors for analyzing the root causes. In the experiment, four potential root causes are considered, i.e., CPU, memory, network, and disk I/O. The experimental results show that the proposed approach is accurate in detecting abnormal tasks as well as finding the root causes. (2) To give a more reasonable probability result and avoid ad-hoc factor weights calculating, we propose a neural network approach to analyze root causes of abnormal tasks. We leverage General Regression Neural Network (GRNN) to identify root causes for abnormal tasks. The likelihood of reported root causes is presented to users according to the weighted factors by GRNN. (3) To further improve anomaly detection by avoiding feature extraction, we propose a novel approach by leveraging Convolutional Neural Networks (CNN). Our proposed model can automatically learn event relationships in system logs and detect anomaly with high accuracy. Our deep neural network consists of logkey2vec embeddings, three 1D convolutional layers, a dropout layer, and max pooling. According to our experiment, our CNN-based approach has better accuracy compared to other approaches using Long Short-Term Memory (LSTM) and Multilayer Perceptron (MLP) on detecting anomaly in Hadoop DistributedFile System (HDFS) logs. (4) To analyze system logs more accurately, we extend our CNN-based approach with two attention schemes to detect anomalies in system logs. The proposed two attention schemes focus on different features from CNN's output. We evaluate our approaches with several benchmarks, and the attention-based CNN model shows the best performance among all state-of-the-art methods.
Show less - Date Issued
- 2019
- Identifier
- CFE0007673, ucf:52499
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007673
- Title
- Student Community Detection and Recommendation of Customized Paths to Reinforce Academic Success.
- Creator
-
Shao, Yuan, Jha, Sumit Kumar, Zhang, Wei, Zhang, Shaojie, University of Central Florida
- Abstract / Description
-
Educational Data Mining (EDM) is a research area that analyzes educational data and extracts interesting and unique information to address education issues. EDM implements computational methods to explore data for the purpose of studying questions related to educational achievements. A common task in an educational environment is the grouping of students and the identification of communities that have common features. Then, these communities of students may be studied by a course developer to...
Show moreEducational Data Mining (EDM) is a research area that analyzes educational data and extracts interesting and unique information to address education issues. EDM implements computational methods to explore data for the purpose of studying questions related to educational achievements. A common task in an educational environment is the grouping of students and the identification of communities that have common features. Then, these communities of students may be studied by a course developer to build a personalized learning system, promote effective group learning, provide adaptive contents, etc. The objective of this thesis is to find an approach to detect student communities and analyze students who do well academically with particular sequences of classes in each community. Then, we compute one or more sequences of courses that a student in a community may pursue to higher their chances of obtaining good academic performance.
Show less - Date Issued
- 2019
- Identifier
- CFE0007529, ucf:52623
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007529
- Title
- Reducing the Overhead of Memory Space, Network Communication and Disk I/O for Analytic Frameworks in Big Data Ecosystem.
- Creator
-
Zhang, Xuhong, Wang, Jun, Fan, Deliang, Lin, Mingjie, Zhang, Shaojie, University of Central Florida
- Abstract / Description
-
To facilitate big data processing, many distributed analytic frameworks and storage systems such as Apache Hadoop, Apache Hama, Apache Spark and Hadoop Distributed File System (HDFS) have been developed. Currently, many researchers are conducting research to either make them more scalable or enabling them to support more analysis applications. In my PhD study, I conducted three main works in this topic, which are minimizing the communication delay in Apache Hama, minimizing the memory space...
Show moreTo facilitate big data processing, many distributed analytic frameworks and storage systems such as Apache Hadoop, Apache Hama, Apache Spark and Hadoop Distributed File System (HDFS) have been developed. Currently, many researchers are conducting research to either make them more scalable or enabling them to support more analysis applications. In my PhD study, I conducted three main works in this topic, which are minimizing the communication delay in Apache Hama, minimizing the memory space and computational overhead in HDFS and minimizing the disk I/O overhead for approximation applications in Hadoop ecosystem. Specifically, In Apache Hama, communication delay makes up a large percentage of the overall graph processing time. While most recent research has focused on reducing the number of network messages, we add a runtime communication and computation scheduler to overlap them as much as possible. As a result, communication delay can be mitigated. In HDFS, the block location table and its corresponding maintenance could occupy more than half of the memory space and 30% of processing capacity in master node, which severely limit the scalability and performance of master node. We propose Deister that uses deterministic mathematical calculations to eliminate the huge table for storing the block locations and its corresponding maintenance. My third work proposes to enable both efficient and accurate approximations on arbitrary sub-datasets of a large dataset. Existing offline sampling based approximation systems are not adaptive to dynamic query workloads and online sampling based approximation systems suffer from low I/O efficiency and poor estimation accuracy. Therefore, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system at a very small cost, and make good use of such information to facilitate online sampling.
Show less - Date Issued
- 2017
- Identifier
- CFE0007299, ucf:52149
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007299
- Title
- Research on Improving Reliability, Energy Efficiency and Scalability in Distributed and Parallel File Systems.
- Creator
-
Zhang, Junyao, Wang, Jun, Zhang, Shaojie, Lee, Jooheung, University of Central Florida
- Abstract / Description
-
With the increasing popularity of cloud computing and "Big data" applications, current data centers are often required to manage petabytes or exabytes of data. To store this huge amount of data, thousands or tens of thousands storage nodes are required at a single site. This imposes three major challenges for storage system designers: (1) Reliability---node failure in these datacenters is a normal occurrence rather than a rare situation. This makes data reliability a great concern. (2) Energy...
Show moreWith the increasing popularity of cloud computing and "Big data" applications, current data centers are often required to manage petabytes or exabytes of data. To store this huge amount of data, thousands or tens of thousands storage nodes are required at a single site. This imposes three major challenges for storage system designers: (1) Reliability---node failure in these datacenters is a normal occurrence rather than a rare situation. This makes data reliability a great concern. (2) Energy efficiency---a data center can consume up to 100 times more energy than a standard office building. More than 10% of this energy consumption can be attributed to storage systems. Thus, reducing the energy consumption of the storage system is key to reducing the overall consumption of the data center.(3) Scalability---with the continuously increasing size of data, maintaining the scalability of the storage systems is essential. That is, the expansion of the storage system should be completed efficiently and without limitations on the total number of storage nodes or performance.This thesis proposes three ways to improve the above three key features for current large-scale storage systems. Firstly, we define the problem of "reverse lookup", namely finding the list of objects (blocks) for a failed node. As the first step of failure recovery, this process is directly related to the recovery/reconstruction time. While existing solutions use metadata traversal or data distribution reversing methods for reverse lookup, which are either time consuming or expensive, a deterministic block placement can achieve fast and efficient reverse lookup.However, the deterministic placement solutions are designed for centralized, small-scale storage architectures such as RAID etc.. Due to their lacking of scalability, they cannot be directly applied in large-scale storage systems. In this paper, we propose Group-Shifted Declustering (G-SD), a deterministic data layout for multi-way replication. G-SD addresses the scalability issue of our previous Shifted Declustering layout and supports fast and efficient reverse lookup.Secondly, we define a problem: "how to balance the performance, energy, and recovery in degradation mode for an energy efficient storage system?". While extensive researches have been proposed to tradeoff performance for energy efficiency under normal mode, the system enters degradation mode when node failure occurs, in which node reconstruction is initiated. This very process requires a number of disks to be spun up and requires a substantial amount of I/O bandwidth, which will not only compromise energy efficiency but also performance. Without considering the I/O bandwidth contention between recovery and performance, we find that the current energy proportional solutions cannot answer this question accurately. This thesis present PERP, a mathematical model to minimize the energy consumption for a storage systems with respect to performance and recovery. PERP answers this problem by providing the accurate number of nodes and the assigned recovery bandwidth at each time frame.Thirdly, current distributed file systems such as Google File System(GFS) and Hadoop Distributed File System (HDFS), employ a pseudo-random method for replica distribution and a centralized lookup table (block map) to record all replica locations. This lookup table requires a large amount of memory and consumes a considerable amount of CPU/network resources on the metadata server. With the booming size of "Big Data", the metadata server becomes a scalability and performance bottleneck. While current approaches such as HDFS Federation attempt to "horizontally" extend scalability by allowing multiple metadata servers, we believe a more promising optimization option is to "vertically" scale up each metadata server. We propose Deister, a novel block management scheme that builds on top of a deterministic declustering distribution method Intersected Shifted Declustering (ISD). Thus both replica distribution and location lookup can be achieved without a centralized lookup table.
Show less - Date Issued
- 2015
- Identifier
- CFE0006238, ucf:51082
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006238
- Title
- Identification and Functional Characterization of a Long Non-coding RNA associated with Prostate Cancer.
- Creator
-
Hasan, Md Faqrul, Chakrabarti, Ratna, Zhao, Jihe, Zhang, Shaojie, University of Central Florida
- Abstract / Description
-
Prostate cancer is the most common cancer in men in the western world. Although early stage prostate cancer is treatable late stage, more specifically, metastatic and drug resistant prostate cancers are mostly incurable. The failure of current treatments obligates the research community to explore novel areas in prostate cancer biology and find better therapeutic targets. Emerging evidences show that non-coding RNAs specifically long non-coding RNAs (lncRNAs) play regulatory roles in various...
Show moreProstate cancer is the most common cancer in men in the western world. Although early stage prostate cancer is treatable late stage, more specifically, metastatic and drug resistant prostate cancers are mostly incurable. The failure of current treatments obligates the research community to explore novel areas in prostate cancer biology and find better therapeutic targets. Emerging evidences show that non-coding RNAs specifically long non-coding RNAs (lncRNAs) play regulatory roles in various cellular processes and are frequently dysregulated in cancer including prostate cancer. These aberrantly expressed lncRNAs mostly with unexplored genetic information may drive cancer progression. Previous studies done in our laboratory showed a tumor suppressor role of a cluster of small non-coding RNAs or microRNA (miRNA) miR-17-92a in PC-3 prostate cancer cells. To learn the underlying mechanism, transcriptome analysis with or without expression of miR-17-92a was conducted in our laboratory. RNA-sequencing data analysis identified reduced expression of a set of lncRNAs and oncogenes, and up regulation of several tumor suppressor genes upon expression of miR-17-92a cluster miRNAs. One of the down regulated intergenic lncRNAs, PAINT (Prostate Cancer Associated Intergenic Non-coding Transcript) (LINC00888), was selected for determining its functional role in prostate cancer. TCGA and GEO profiles analyses revealed up regulation of PAINT in prostate tumors with higher Gleason Scores, in highly aggressive metastatic prostate cancer cell lines, and upon androgen deprivation therapy of prostate cancer cells. This observation was supported by our studies on expression analysis of PAINT in prostate tumor tissues using RNA in-situ hybridization in tissue microarrays (TMA) containing tissues from different stages of prostate cancer and normal prostate tissues, which showed higher expression of PAINT in prostate cancer tissues compared to normal tissues. Furthermore, late stage (stage III and stage IV) prostate tumors showed significant overexpression of PAINT compared to early stage (stage II) prostate cancer tissues. We examined the functional relevance of PAINT in promoting tumor progression next using different prostate cancer cell lines. Silencing of PAINT using siRNAs showed decreased cell proliferation, reduced S-phase progression and activation of pro-apoptotic proteins PARP and Caspase-3. Silencing of PAINT also showed decreased cell migration and increased expression of the epithelial marker, E-cadherin while reduced expression of mesenchymal markers Slug and Vimentin. Ectopic expression of PAINT reversed the effects observed upon silencing of PAINT. Increased cell proliferation, cell cycle progression and cell migration were noted in prostate cancer cells overexpressing PAINT. Additionally, cancer promoting phenotype such as larger colony formation and higher expression of mesenchymal marker Slug, was detected upon overexpression of PAINT. Our study also determined the therapeutic benefit of inhibition of expression showing an increased sensitivity of metastatic prostate cancer cells to the chemotherapeutic agent docetaxel (DTX) and selective Aurora kinase inhibitor VX-680. Taken together, our study establishes an oncogenic function of PAINT, its clinical relevance as a marker for advanced stage prostate cancer and its potential as a therapeutic target for metastatic prostate cancer.
Show less - Date Issued
- 2019
- Identifier
- CFE0007466, ucf:52681
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007466
- Title
- Analysis of large-scale population genetic data using efficient algorithms and data structures.
- Creator
-
Naseri, Ardalan, Zhang, Shaojie, Hughes, Charles, Yooseph, Shibu, Zhi, Degui, University of Central Florida
- Abstract / Description
-
With the availability of genotyping data of very large samples, there is an increasing need for tools that can efficiently identify genetic relationships among all individuals in the sample. Modern biobanks cover genotypes up to 0.1%-1% of an entire large population. At this scale, genetic relatedness among samples is ubiquitous. However, current methods are not efficient for uncovering genetic relatedness at such a scale. We developed a new method, Random Projection for IBD Detection (RaPID)...
Show moreWith the availability of genotyping data of very large samples, there is an increasing need for tools that can efficiently identify genetic relationships among all individuals in the sample. Modern biobanks cover genotypes up to 0.1%-1% of an entire large population. At this scale, genetic relatedness among samples is ubiquitous. However, current methods are not efficient for uncovering genetic relatedness at such a scale. We developed a new method, Random Projection for IBD Detection (RaPID), for detecting Identical-by-Descent (IBD) segments, a fundamental concept in genetics in large panels. RaPID detects all IBD segments over a certain length in time linear to the sample size. We take advantage of an efficient population genotype index, Positional BWT (PBWT), by Richard Durbin. PBWT achieves linear time query of perfectly identical subsequences among all samples. However, the original PBWT is not tolerant to genotyping errors which often interrupt long IBD segments into short fragments. The key idea of RaPID is that the problem of approximate high-resolution matching over a long range can be mapped to the problem of exact matching of low-resolution subsampled sequences with high probability. PBWT provides an appropriate data structure for bi-allelic data. With the increasing sample sizes, more multi-allelic sites are expected to be observed. Hence, there is a necessity to handle multi-allelic genotype data. We also introduce a multi-allelic version of the original Positional Burrows-Wheeler Transform (mPBWT).The increasingly large cohorts of whole genome genotype data present an opportunity for searching genetically related people within a large cohort to an individual. At the same time, doing so efficiently presents a challenge. The PBWT algorithm offers constant time matching between one haplotype and an arbitrarily large panel at each position, but only for the maximal matches. We used the PBWT data structure to develop a method to search for all matches of a given query in a panel. The matches larger than a given length correspond to the all shared IBD segments of certain lengths between the query and other individuals in the panel. The time complexity of the proposed method is independent from the number of individuals in the panel. In order to achieve a time complexity independent from the number of haplotypes, additional data structures are introduced.Some regions of genome may be shared by multiple individuals rather than only a pair. Clusters of identical haplotypes could reveal information about the history of intermarriage, isolation of a population and also be medically important. We propose an efficient method to find clusters of identical segments among individuals in a large panel, called cPBWT, using PBWT data structure. The time complexity of finding all clusters of identical matches is linear to the sample size. Human genome harbors several runs of homozygous sites (ROHs) where identical haplotypes are inherited from each parent. We applied cPBWT on UK-Biobank and searched for clusters of ROH region that are shared among multiple. We discovered strong associations between ROH regions and some non-cancerous diseases, specifically auto-immune disorders.
Show less - Date Issued
- 2018
- Identifier
- CFE0007764, ucf:52393
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007764
- Title
- Efficient String Graph Construction Algorithm.
- Creator
-
Morshed, S.M. Iqbal, Yooseph, Shibu, Zhang, Shaojie, Valliyil Thankachan, Sharma, University of Central Florida
- Abstract / Description
-
In the field of genome assembly research where assemblers are dominated by de Bruijn graph-based approaches, string graph-based assembly approach is getting more attention because of its ability to losslessly retain information from sequence data. Despite the advantages provided by a string graph in repeat detection and in maintaining read coherence, the high computational cost for constructing a string graph hinders its usability for genome assembly. Even though different algorithms have...
Show moreIn the field of genome assembly research where assemblers are dominated by de Bruijn graph-based approaches, string graph-based assembly approach is getting more attention because of its ability to losslessly retain information from sequence data. Despite the advantages provided by a string graph in repeat detection and in maintaining read coherence, the high computational cost for constructing a string graph hinders its usability for genome assembly. Even though different algorithms have been proposed over the last decade for string graph construction, efficiency is still a challenge due to the demand for processing a large amount of sequence data generated by NGS technologies. Therefore, in this thesis, we provide a novel, linear time and alphabet-size-independent algorithm SOF which uses the property of irreducible edges and transitive edges to efficiently construct string graph from an overlap graph. Experimental results show that SOF is at least 2 times faster than the string graph construction algorithm provided in SGA, one of the most popular string graph-based assembler, while maintaining almost the same memory footprint as SGA. Moreover, the availability of SOF as a subprogram in the SGA assembly pipeline will give user facilities to access the preprocessing and postprocessing steps for genome assembly provided in SGA.
Show less - Date Issued
- 2019
- Identifier
- CFE0007504, ucf:52635
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007504
- Title
- Computational Methods for Comparative Non-coding RNA Analysis: From Structural Motif Identification to Genome-wide Functional Classification.
- Creator
-
Zhong, Cuncong, Zhang, Shaojie, Hu, Haiyan, Hua, Kien, Li, Xiaoman, University of Central Florida
- Abstract / Description
-
Non-coding RNA (ncRNA) plays critical functional roles such as regulation, catalysis, and modification etc. in the biological system. Non-coding RNAs exert their functions based on their specific structures, which makes the thorough understanding of their structures a key step towards their complete functional annotation. In this dissertation, we will cover a suite of computational methods for the comparison of ncRNA secondary and 3D structures, and their applications to ncRNA molecular...
Show moreNon-coding RNA (ncRNA) plays critical functional roles such as regulation, catalysis, and modification etc. in the biological system. Non-coding RNAs exert their functions based on their specific structures, which makes the thorough understanding of their structures a key step towards their complete functional annotation. In this dissertation, we will cover a suite of computational methods for the comparison of ncRNA secondary and 3D structures, and their applications to ncRNA molecular structural annotation and their genome-wide functional survey.Specifically, we have contributed the following five computational methods. First, we have developed an alignment algorithm to compare RNA structural motifs, which are recurrent RNA 3D structural fragments. Second, we have improved upon the previous alignment algorithm by incorporating base-stacking information and devise a new branch-and-bond algorithm. Third, we have developed a clustering pipeline for RNA structural motif classification using the above alignment methods. Fourth, we have generalized the clustering pipeline to a genome-wide analysis of RNA secondary structures. Finally, we have devised an ultra-fast alignment algorithm for RNA secondary structure by using the sparse dynamic programming technique.A large number of novel RNA structural motif instances and ncRNA elements have been discovered throughout these studies. We anticipate that these computational methods will significantly facilitate the analysis of ncRNA structures in the future.
Show less - Date Issued
- 2013
- Identifier
- CFE0004966, ucf:49580
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004966
- Title
- Finding Consensus Energy Folding Landscapes Between RNA Sequences.
- Creator
-
Burbridge, Joshua, Zhang, Shaojie, Hu, Haiyan, Jha, Sumit, University of Central Florida
- Abstract / Description
-
In molecular biology, the secondary structure of a ribonucleic acid (RNA) molecule is closely related to its biological function. One problem in structural bioinformatics is to determine the two- and three-dimensional structure of RNA using only sequencing information, which can be obtained at low cost. This entails designing sophisticated algorithms to simulate the process of RNA folding using detailed sets of thermodynamic parameters. The set of all chemically feasible structures an RNA...
Show moreIn molecular biology, the secondary structure of a ribonucleic acid (RNA) molecule is closely related to its biological function. One problem in structural bioinformatics is to determine the two- and three-dimensional structure of RNA using only sequencing information, which can be obtained at low cost. This entails designing sophisticated algorithms to simulate the process of RNA folding using detailed sets of thermodynamic parameters. The set of all chemically feasible structures an RNA molecule can assume, as well as the energy associated with each structure, is called its energy folding landscape. This research focuses on defining and solving the problem of finding the consensus landscape between multiple RNA molecules. Specifically, we discuss how this problem is equivalent to the problem of Balanced Global Network Alignment, and what effect a solution to this problem would have on our understanding of RNA.Because this problem is known to be NP-hard, we instead define an approximate consensus on a landscape of reduced size, which dramatically reduces the searching space associated with the problem. We use the program RNASLOpt to enumerate all stable local optimal secondary structures in multiple landscapes within a certain energy and stability range of the minimum free energy (MFE) structure. We then encode these using an extended structural alphabet and perform sequence alignment using a structural substitution matrix to find and rank the best matches between the sets based on stability, energy, and structural distance. We apply this method to twenty landscapes from four sets of riboswitches from Bacillus subtillis in order to predict their native (")on(") and (")off(") structures. We find that this method significantly reduces the size of the list of candidate structures, as well as increasing the ranking of previously obscure secondary structures, resulting in more accurate predictions overall. Advances in the field of structural bioinformatics can help elucidate the underlying mechanisms of many genetic diseases.
Show less - Date Issued
- 2015
- Identifier
- CFE0006210, ucf:51109
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006210
- Title
- Security of Autonomous Systems under Physical Attacks: With application to Self-Driving Cars.
- Creator
-
Dutta, Raj, Jin, Yier, Sundaram, Kalpathy, DeMara, Ronald, Zhang, Shaojie, Zhang, Teng, University of Central Florida
- Abstract / Description
-
The drive to achieve trustworthy autonomous cyber-physical systems (CPS), which can attain goals independently in the presence of significant uncertainties and for long periods of time without any human intervention, has always been enticing. Significant progress has been made in the avenues of both software and hardware for fulfilling these objectives. However, technological challenges still exist and particularly in terms of decision making under uncertainty. In an autonomous system,...
Show moreThe drive to achieve trustworthy autonomous cyber-physical systems (CPS), which can attain goals independently in the presence of significant uncertainties and for long periods of time without any human intervention, has always been enticing. Significant progress has been made in the avenues of both software and hardware for fulfilling these objectives. However, technological challenges still exist and particularly in terms of decision making under uncertainty. In an autonomous system, uncertainties can arise from the operating environment, adversarial attacks, and from within the system. As a result of these concerns, human-beings lack trust in these systems and hesitate to use them for day-to-day use.In this dissertation, we develop algorithms to enhance trust by mitigating physical attacks targeting the integrity and security of sensing units of autonomous CPS. The sensors of these systems are responsible for gathering data of the physical processes. Lack of measures for securing their information can enable malicious attackers to cause life-threatening situations. This serves as a motivation for developing attack resilient solutions.Among various security solutions, attention has been recently paid toward developing system-level countermeasures for CPS whose sensor measurements are corrupted by an attacker. Our methods are along this direction as we develop an active and multiple passive algorithm to detect the attack and minimize its effect on the internal state estimates of the system. In the active approach, we leverage a challenge authentication technique for detection of two types of attacks: The Denial of Service (DoS) and the delay injection on active sensors of the systems. Furthermore, we develop a recursive least square estimator for recovery of system from attacks. The majority of the dissertation focuses on designing passive approaches for sensor attacks. In the first method, we focus on a linear stochastic system with multiple sensors, where measurements are fused in a central unit to estimate the state of the CPS. By leveraging Bayesian interpretation of the Kalman filter and combining it with the Chi-Squared detector, we recursively estimate states within an error bound and detect the DoS and False Data Injection attacks. We also analyze the asymptotic performance of the estimator and provide conditions for resilience of the state estimate.Next, we propose a novel distributed estimator based on l1 norm optimization, which could recursively estimate states within an error bound without restricting the number of agents of the distributed system that can be compromised. We also extend this estimator to a vehicle platoon scenario which is subjected to sparse attacks. Furthermore, we analyze the resiliency and asymptotic properties of both the estimators. Finally, at the end of the dissertation, we make an initial effort to formally verify the control system of the autonomous CPS using the statistical model checking method. It is done to ensure that a real-time and resource constrained system such as a self-driving car, with controllers and security solutions, adheres to strict timing constrains.
Show less - Date Issued
- 2018
- Identifier
- CFE0007174, ucf:52253
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007174
- Title
- Developing new power management and High-Reliability Schemes in Data-Intensive Environment.
- Creator
-
Wang, Ruijun, Wang, Jun, Jin, Yier, DeMara, Ronald, Zhang, Shaojie, Ni, Liqiang, University of Central Florida
- Abstract / Description
-
With the increasing popularity of data-intensive applications as well as the large-scale computingand storage systems, current data centers and supercomputers are often dealing with extremelylarge data-sets. To store and process this huge amount of data reliably and energy-efficiently,three major challenges should be taken into consideration for the system designers. Firstly, power conservation(-)Multicore processors or CMPs have become a mainstream in the current processormarket because of...
Show moreWith the increasing popularity of data-intensive applications as well as the large-scale computingand storage systems, current data centers and supercomputers are often dealing with extremelylarge data-sets. To store and process this huge amount of data reliably and energy-efficiently,three major challenges should be taken into consideration for the system designers. Firstly, power conservation(-)Multicore processors or CMPs have become a mainstream in the current processormarket because of the tremendous improvement in transistor density and the advancement in semiconductor technology. However, the increasing number of transistors on a single die or chip reveals a super-linear growth in power consumption [4]. Thus, how to balance system performance andpower-saving is a critical issue which needs to be solved effectively. Secondly, system reliability(-)Reliability is a critical metric in the design and development of replication-based big data storagesystems such as Hadoop File System (HDFS). In the system with thousands machines and storagedevices, even in-frequent failures become likely. In Google File System, the annual disk failurerate is 2:88%,which means you were expected to see 8,760 disk failures in a year. Unfortunately,given an increasing number of node failures, how often a cluster starts losing data when beingscaled out is not well investigated. Thirdly, energy efficiency(-)The fast processing speeds of the current generation of supercomputers provide a great convenience to scientists dealing with extremely large data sets. The next generation of (")exascale(") supercomputers could provide accuratesimulation results for the automobile industry, aerospace industry, and even nuclear fusion reactors for the very first time. However, the energy cost of super-computing is extremely high, with a total electricity bill of 9 million dollars per year. Thus, conserving energy and increasing the energy efficiency of supercomputers has become critical in recent years.This dissertation proposes new solutions to address the above three key challenges for currentlarge-scale storage and computing systems. Firstly, we propose a novel power management scheme called MAR (model-free, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption subject to performance constraints. By introducing new I/O wait status, MAR is able to accurately describe the relationship between core frequencies, performance and power consumption. Moreover, we adopt a model-free control method to filter out the I/O wait status from the traditional CPU busy/idle model in order to achieve fast responsiveness to burst situations and take full advantage of power saving. Our extensive experiments on a physical testbed demonstrate that, for SPEC benchmarks and data-intensive (TPC-C) benchmarks, an MAR prototype system achieves 95.8-97.8% accuracy of the ideal power saving strategy calculated offline. Compared with baseline solutions, MAR is able to save 12.3-16.1% more power while maintain a comparable performance loss of about 0.78-1.08%. In addition, more simulation results indicate that our design achieved 3.35-14.2% more power saving efficiency and 4.2-10.7% less performance loss under various CMP configurations as compared with various baseline approaches such as LAST, Relax,PID and MPC.Secondly, we create a new reliability model by incorporating the probability of replica loss toinvestigate the system reliability of multi-way declustering data layouts and analyze their potential parallel recovery possibilities. Our comprehensive simulation results on Matlab and SHARPE show that the shifted declustering data layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by upto 63% and 85% respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that, the shifted declustering layout surpasses the two baseline approaches in both cases by consuming up to 79 % and 87% less recovery bandwidth for copyset, as well as 4.8% and 10.2% less recovery bandwidth for random layout.Thirdly, we develop a power-aware job scheduler by applying a rule based control method and takinginto account real world power and speedup profiles to improve power efficiency while adheringto predetermined power constraints. The intensive simulation results shown that our proposed method is able to achieve the maximum utilization of computing resources as compared to baselinescheduling algorithms while keeping the energy cost under the threshold. Moreover, by introducinga Power Performance Factor (PPF) based on the real world power and speedup profiles, we areable to increase the power efficiency by up to 75%.
Show less - Date Issued
- 2016
- Identifier
- CFE0006704, ucf:51907
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006704
- Title
- Leaning Robust Sequence Features via Dynamic Temporal Pattern Discovery.
- Creator
-
Hu, Hao, Wang, Liqiang, Zhang, Shaojie, Liu, Fei, Qi, GuoJun, Zhou, Qun, University of Central Florida
- Abstract / Description
-
As a major type of data, time series possess invaluable latent knowledge for describing the real world and human society. In order to improve the ability of intelligent systems for understanding the world and people, it is critical to design sophisticated machine learning algorithms for extracting robust time series features from such latent knowledge. Motivated by the successful applications of deep learning in computer vision, more and more machine learning researchers put their attentions...
Show moreAs a major type of data, time series possess invaluable latent knowledge for describing the real world and human society. In order to improve the ability of intelligent systems for understanding the world and people, it is critical to design sophisticated machine learning algorithms for extracting robust time series features from such latent knowledge. Motivated by the successful applications of deep learning in computer vision, more and more machine learning researchers put their attentions on the topic of applying deep learning techniques to time series data. However, directly employing current deep models in most time series domains could be problematic. A major reason is that temporal pattern types that current deep models are aiming at are very limited, which cannot meet the requirement of modeling different underlying patterns of data coming from various sources. In this study we address this problem by designing different network structures explicitly based on specific domain knowledge such that we can extract features via most salient temporal patterns. More specifically, we mainly focus on two types of temporal patterns: order patterns and frequency patterns. For order patterns, which are usually related to brain and human activities, we design a hashing-based neural network layer to globally encode the ordinal pattern information into the resultant features. It is further generalized into a specially designed Recurrent Neural Networks (RNN) cell which can learn order patterns in an online fashion. On the other hand, we believe audio-related data such as music and speech can benefit from modeling frequency patterns. Thus, we do so by developing two types of RNN cells. The first type tries to directly learn the long-term dependencies on frequency domain rather than time domain. The second one aims to dynamically filter out the ``noise" frequencies based on temporal contexts. By proposing various deep models based on different domain knowledge and evaluating them on extensive time series tasks, we hope this work can provide inspirations for others and increase the community's interests on the problem of applying deep learning techniques to more time series tasks.
Show less - Date Issued
- 2019
- Identifier
- CFE0007470, ucf:52679
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007470
- Title
- Managing IO Resource for Co-running Data Intensive Applications in Virtual Clusters.
- Creator
-
Huang, Dan, Wang, Jun, Zhou, Qun, Sun, Wei, Zhang, Shaojie, Wang, Liqiang, University of Central Florida
- Abstract / Description
-
Today Big Data computer platforms employ resource management systems such as Yarn, Torque, Mesos, and Google Borg to enable sharing the physical computing among many users or applications. Given virtualization and resource management systems, users are able to launch their applications on the same node with low mutual interference and management overhead on CPU and memory. However, there are still challenges to be addressed before these systems can be fully adopted to manage the IO resources...
Show moreToday Big Data computer platforms employ resource management systems such as Yarn, Torque, Mesos, and Google Borg to enable sharing the physical computing among many users or applications. Given virtualization and resource management systems, users are able to launch their applications on the same node with low mutual interference and management overhead on CPU and memory. However, there are still challenges to be addressed before these systems can be fully adopted to manage the IO resources in Big Data File Systems (BDFS) and shared network facilities. In this study, we mainly study on three IO management problems systematically, in terms of the proportional sharing of block IO in container-based virtualization, the network IO contention in MPI-based HPC applications and the data migration overhead in HPC workflows. To improve the proportional sharing, we develop a prototype system called BDFS-Container, by containerizing BDFS at Linux block IO level. Central to BDFS-Container, we propose and design a proactive IOPS throttling based mechanism named IOPS Regulator, which improves proportional IO sharing under the BDFS IO pattern by 74.4% on an average. In the aspect of network IO resource management, we exploit using virtual switches to facilitate network traffic manipulation and reduce mutual interference on the network for in-situ applications. In order to dynamically allocate the network bandwidth when it is needed, we adopt SARIMA-based techniques to analyze and predict MPI traffic issued from simulations. Third, to solve the data migration problem in small-medium sized HPC clusters, we propose to construct a sided IO path, named as SideIO, to explicitly direct analysis data to BDFS that co-locates computation with data. By experimenting with two real-world scientific workflows, SideIO completely avoids the most expensive data movement overhead and achieves up to 3x speedups compared with current solutions.
Show less - Date Issued
- 2018
- Identifier
- CFE0007195, ucf:52268
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007195
- Title
- Exploring Natural User Abstractions For Shared Perceptual Manipulator Task Modeling (&) Recovery.
- Creator
-
Koh, Senglee, Laviola II, Joseph, Foroosh, Hassan, Zhang, Shaojie, Kim, Si Jung, University of Central Florida
- Abstract / Description
-
State-of-the-art domestic robot assistants are essentially autonomous mobile manipulators capable of exerting human-scale precision grasps. To maximize utility and economy, non-technical end-users would need to be nearly as efficient as trained roboticists in control and collaboration of manipulation task behaviors. However, it remains a significant challenge given that many WIMP-style tools require superficial proficiency in robotics, 3D graphics, and computer science for rapid task modeling...
Show moreState-of-the-art domestic robot assistants are essentially autonomous mobile manipulators capable of exerting human-scale precision grasps. To maximize utility and economy, non-technical end-users would need to be nearly as efficient as trained roboticists in control and collaboration of manipulation task behaviors. However, it remains a significant challenge given that many WIMP-style tools require superficial proficiency in robotics, 3D graphics, and computer science for rapid task modeling and recovery. But research on robot-centric collaboration has garnered momentum in recent years; robots are now planning in partially observable environments that maintain geometries and semantic maps, presenting opportunities for non-experts to cooperatively control task behavior with autonomous-planning agents exploiting the knowledge. However, as autonomous systems are not immune to errors under perceptual difficulty, a human-in-the-loop is needed to bias autonomous-planning towards recovery conditions that resume the task and avoid similar errors.In this work, we explore interactive techniques allowing non-technical users to model task behaviors and perceive cooperatively with a service robot under robot-centric collaboration. We evaluate stylus and touch modalities that users can intuitively and effectively convey natural abstractions of high-level tasks, semantic revisions, and geometries about the world. Experiments are conducted with `pick-and-place' tasks in an ideal `Blocks World' environment using a Kinova JACO six degree-of-freedom manipulator. Possibilities for the architecture and interface are demonstrated with the following features; (1) Semantic `Object' and `Location' grounding that describe function and ambiguous geometries (2) Task specification with an unordered list of goal predicates, and (3) Guiding task recovery with implied scene geometries and trajectory via symmetry cues and configuration space abstraction. Empirical results from four user studies show our interface was much preferred than the control condition, demonstrating high learnability and ease-of-use that enable our non-technical participants to model complex tasks, provide effective recovery assistance, and teleoperative control.
Show less - Date Issued
- 2018
- Identifier
- CFE0007209, ucf:52292
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007209
- Title
- Computational Methods for Analyzing RNA Folding Landscapes and its Applications.
- Creator
-
Li, Yuan, Zhang, Shaojie, Hua, Kien, Jha, Sumit, Hu, Haiyan, Li, Xiaoman, University of Central Florida
- Abstract / Description
-
Non-protein-coding RNAs play critical regulatory roles in cellular life. Many ncRNAs fold into specific structures in order to perform their biological functions. Some of the RNAs, such as riboswitches, can even fold into alternative structural conformations in order to participate in different biological processes. In addition, these RNAs can transit dynamically between different functional structures along folding pathways on their energy landscapes. These alternative functional structures...
Show moreNon-protein-coding RNAs play critical regulatory roles in cellular life. Many ncRNAs fold into specific structures in order to perform their biological functions. Some of the RNAs, such as riboswitches, can even fold into alternative structural conformations in order to participate in different biological processes. In addition, these RNAs can transit dynamically between different functional structures along folding pathways on their energy landscapes. These alternative functional structures are usually energetically favored and are stable in their local energy landscapes. Moreover, conformational transitions between any pair of alternate structures usually involve high energy barriers, such that RNAs can become kinetically trapped by these stable and local optimal structures.We have proposed a suite of computational approaches for analyzing and discovering regulatory RNAs through studying folding pathways, alternative structures and energy landscapes associated with conformational transitions of regulatory RNAs. First, we developed an approach, RNAEAPath, which can predict low-barrier folding pathways between two conformational structures of a single RNA molecule. Using RNAEAPath, we can analyze folding pathways between two functional RNA structures, and therefore study the mechanism behind RNA functional transitions from a thermodynamic perspective. Second, we introduced an approach, RNASLOpt, for finding all the stable and local optimal structures on the energy landscape of a single RNA molecule. We can use the generated stable and local optimal structures to represent the RNA energy landscape in a compact manner. In addition, we applied RNASLOpt to several known riboswitches and predicted their alternate functional structures accurately. Third, we integrated a comparative approach with RNASLOpt, and developed RNAConSLOpt, which can find all the consensus stable and local optimal structuresthat are conserved among a set of homologous regulatory RNAs. We can use RNAConSLOpt to predict alternate functional structures for regulatory RNA families. Finally, we have proposed a pipeline making use of RNAConSLOpt to computationally discover novel riboswitches in bacterial genomes. An application of the proposed pipeline to a set of bacteria in Bacillus genus results in the re-discovery of many known riboswitches, and the detection of several novel putative riboswitch elements.
Show less - Date Issued
- 2012
- Identifier
- CFE0004400, ucf:49365
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004400
- Title
- Human Action Localization and Recognition in Unconstrained Videos.
- Creator
-
Boyraz, Hakan, Tappen, Marshall, Foroosh, Hassan, Lin, Mingjie, Zhang, Shaojie, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
As imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing...
Show moreAs imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing discriminative sub-regions of images and videos when performing recognition tasks. In this thesis, we address the action detection and recognition problems. Action detection in video is a particularly difficult problem because actions must not only be recognized correctly, but must also be localized in the 3D spatio-temporal volume. We introduce a technique that transforms the 3D localization problem into a series of 2D detection tasks. This is accomplished by dividing the video into overlapping segments, then representing each segment with a 2D video projection. The advantage of the 2D projection is that it makes it convenient to apply the best techniques from object detection to the action detection problem. We also introduce a novel, straightforward method for searching the 2D projections to localize actions, termed Two-Point Subwindow Search (TPSS). Finally, we show how to connect the local detections in time using a chaining algorithm to identify the entire extent of the action. Our experiments show that video projection outperforms the latest results on action detection in a direct comparison.Second, we present a probabilistic model learning to identify discriminative regions in videos from weakly-supervised data where each video clip is only assigned a label describing what action is present in the frame or clip. While our first system requires every action to be manually outlined in every frame of the video, this second system only requires that the video be given a single high-level tag. From this data, the system is able to identify discriminative regions that correspond well to the regions containing the actual actions. Our experiments on both the MSR Action Dataset II and UCF Sports Dataset show that the localizations produced by this weakly supervised system are comparable in quality to localizations produced by systems that require each frame to be manually annotated. This system is able to detect actions in both 1) non-temporally segmented action videos and 2) recognition tasks where a single label is assigned to the clip. We also demonstrate the action recognition performance of our method on two complex datasets, i.e. HMDB and UCF101. Third, we extend our weakly-supervised framework by replacing the recognition stage with a two-stage neural network and apply dropout for preventing overfitting of the parameters on the training data. Dropout technique has been recently introduced to prevent overfitting of the parameters in deep neural networks and it has been applied successfully to object recognition problem. To our knowledge, this is the first system using dropout for action recognition problem. We demonstrate that using dropout improves the action recognition accuracies on HMDB and UCF101 datasets.
Show less - Date Issued
- 2013
- Identifier
- CFE0004977, ucf:49562
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004977