Current Search: Shah, Mubarak (x)
View All Items
Pages
- Title
- PATTERNS OF MOTION: DISCOVERY AND GENERALIZED REPRESENTATION.
- Creator
-
Saleemi, Imran, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of discovery and representation of motion patterns in a variety of scenarios, commonly encountered in vision applications. The overarching goal is to devise a generic representation, that captures any kind of object motion observable in video sequences. Such motion is a significant source of information typically employed for diverse applications such as tracking, anomaly detection, and action and event recognition. We present statistical...
Show moreIn this dissertation, we address the problem of discovery and representation of motion patterns in a variety of scenarios, commonly encountered in vision applications. The overarching goal is to devise a generic representation, that captures any kind of object motion observable in video sequences. Such motion is a significant source of information typically employed for diverse applications such as tracking, anomaly detection, and action and event recognition. We present statistical frameworks for representation of motion characteristics of objects, learned from tracks or optical flow, for static as well as moving cameras, and propose algorithms for their application to a variety of problems. The proposed motion pattern models and learning methods are general enough to be employed in a variety of problems as we demonstrate experimentally. We first propose a novel method to model and learn the scene activity, observed by a static camera. The motion patterns of objects in the scene are modeled in the form of a multivariate non-parametric probability density function of spatiotemporal variables (object locations and transition times between them). Kernel Density Estimation (KDE) is used to learn this model in a completely unsupervised fashion. Learning is accomplished by observing the trajectories of objects by a static camera over extended periods of time. The model encodes the probabilistic nature of the behavior of moving objects in the scene and is useful for activity analysis applications, such as persistent tracking and anomalous motion detection. In addition, the model also captures salient scene features, such as, the areas of occlusion and most likely paths. Once the model is learned, we use a unified Markov Chain Monte-Carlo (MCMC) based framework for generating the most likely paths in the scene, improving foreground detection, persistent labelling of objects during tracking and deciding whether a given trajectory represents an anomaly to the observed motion patterns. Experiments with real world videos are reported which validate the proposed approach. The representation and estimation framework proposed above, however, has a few limitations. This algorithm proposes to use a single global statistical distribution to represent all kinds of motion observed in a particular scene. It therefore, does not find a separation between multiple semantically distinct motion patterns in the scene. Instead, the learned model is a joint distribution over all possible patterns followed by objects. To overcome this limitation, we then propose a superior method for the discovery and statistical representation of motion patterns in a scene. The advantages of this approach over the first one are two-fold: first, this model is applicable to scenes of dense crowded motion where tracking may not be feasible, and second, it distinguishes between motion patterns that are distinct at a semantic level of abstraction. We propose a mixture model representation of salient patterns of optical flow, and present an algorithm for learning these patterns from dense optical flow in a hierarchical, unsupervised fashion. Using low level cues of noisy optical flow, K-means is employed to initialize a Gaussian mixture model for temporally segmented clips of video. The components of this mixture are then filtered and instances of motion patterns are computed using a simple motion model, by linking components across space and time. Motion patterns are then initialized and membership of instances in different motion patterns is established by using KL divergence between mixture distributions of pattern instances. Finally, a pixel level representation of motion patterns is proposed by deriving conditional expectation of optical flow. Results of extensive experiments are presented for multiple surveillance sequences containing numerous patterns involving both pedestrian and vehicular traffic. The proposed method exploits optical flow as the low level feature and performs a hierarchical clustering to obtain motion patterns; and we observe that the use of optical flow is also an integral part of a variety of other vision applications, for example, as features based representation of human actions. We, therefore, propose a new representation for articulated human actions using the motion patterns. The representation is based on hierarchical clustering of observed optical flow in four dimensional, spatial and motion flow space. The automatically discovered motion patterns, are the primitive actions, representative of flow at salient regions on the human body, much like trajectories of body joints, which are notoriously difficult to obtain automatically. The proposed method works in a completely unsupervised fashion, and in sharp contrast to state of the art representations like bag of video words, provides a truly semantically meaningful representation. Each primitive action depicts the most atomic sub-action, like left arm moving upwards, or right leg moving downward and leftward, and is represented by a mixture of four dimensional Gaussian distributions. A sequence of primitive actions are discovered in the test video, and labelled by computing the KL divergence between mixtures. The entire video sequence containing the human action, is thus reduced to a simple string, which is matched against similar strings of training videos to classify the action. The string matching is performed by global alignment, using the well-known Needleman-Wunsch algorithm. Experiments reported on multiple human actions data sets, confirm the validity, simplicity, and semantically meaningful nature of the proposed representation. Results obtained are encouraging and comparable to the state of the art.
Show less - Date Issued
- 2011
- Identifier
- CFE0003646, ucf:48836
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003646
- Title
- Shape reconstruction from shading using linear approximation.
- Creator
-
Tsai, Ping Sing, Shah, Mubarak, Arts and Sciences
- Abstract / Description
-
University of Central Florida College of Arts and Sciences Thesis; Shape from shading (SFS) deals with the recovery of 3D shape from a single monocular image. This problem was formally introduced by Horn in the early 1970s. Since then it has received considerable attention, and several efforts have been made to improve the shape recovery. In this thesis, we present a fast SFS algorithm, which is a purely local method and is highly parallelizable. In our approach, we first use the discrete...
Show moreUniversity of Central Florida College of Arts and Sciences Thesis; Shape from shading (SFS) deals with the recovery of 3D shape from a single monocular image. This problem was formally introduced by Horn in the early 1970s. Since then it has received considerable attention, and several efforts have been made to improve the shape recovery. In this thesis, we present a fast SFS algorithm, which is a purely local method and is highly parallelizable. In our approach, we first use the discrete approximations for surface gradients, p and q, using finite differences, then linearize the reflectance function in depth, Z ( x , y), instead of p and q. This method is simple and efficient, and yields better results for images with central illumination or low-angle illumination. Furthermore, our method is more general, and can be applied to either Lambertian surfaces or specular surfaces. The algorithm has been tested on several synthetic and real images of both Lambertian and specular surfaces, and good results have been obtained. However, our method assumes that the input image contains only single object with uniform albedo values, which is commonly assumed in most SFS methods. Our algorithm performs poorly on images with nonuniform albedo values and produces incorrect shape for images containing objects with scale ambiguity, because those images violate the basic assumptions made by our SFS method. Therefore, we extended our method for images with nonuniform albedo values. We first estimate the albedo values for each pixel, and segment the scene into regions with uniform albedo values. Then we adjust the intensity value for each pixel by dividing the corresponding albedo value before applying our linear shape from shading method. This way our modified method is able to deal with nonuniform albedo values. When multiple objects differing only in scale are present in a scene, there may be points with the same surface orientation but different depth values. No existing SFS methods can solve this kind of ambiguity directly. We also present a new approach to deal with images containing multiple objects with scale ambiguity. A depth estimate is derived from patches using a minimum downhill approach and re-aligned based on the background information to get the correct depth map. Experimental results are presented for several synthetic and real images. Finally, this thesis also investigates the problem of the discrete approximation under perspective projection. The straightforward finite difference approximation for surface gradients used under orthographic projection is no longer applicable here. because the image position components are in fact functions of the depth. In this thesis, we provide a direct solution for the discrete approximation under perspective projection. The surface gradient is derived mathematically by relating the depth value of the surface point with the depth value of the corresponding image point. We also demonstrate how we can apply the new discrete approximation to a more complicated and realistic reflectance model for SFS problem.
Show less - Date Issued
- 1995
- Identifier
- CFR0000191, ucf:53139
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFR0000191
- Title
- Human Action Detection, Tracking and Segmentation in Videos.
- Creator
-
Tian, Yicong, Shah, Mubarak, Bagci, Ulas, Liu, Fei, Walker, John, University of Central Florida
- Abstract / Description
-
This dissertation addresses the problem of human action detection, human tracking and segmentation in videos. They are fundamental tasks in computer vision and are extremely challenging to solve in realistic videos. We first propose a novel approach for action detection by exploring the generalization of deformable part models from 2D images to 3D spatiotemporal volumes. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to...
Show moreThis dissertation addresses the problem of human action detection, human tracking and segmentation in videos. They are fundamental tasks in computer vision and are extremely challenging to solve in realistic videos. We first propose a novel approach for action detection by exploring the generalization of deformable part models from 2D images to 3D spatiotemporal volumes. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. This approach deals with detecting action performed by a single person. When there are multiple humans in the scene, humans need to be segmented and tracked from frame to frame before action recognition can be performed. Next, we propose a novel approach for multiple object tracking (MOT) by formulating detection and data association in one framework. Our method allows us to overcome the confinements of data association based MOT approaches, where the performance is dependent on the object detection results provided at input level. We show that automatically detecting and tracking targets in a single framework can help resolve the ambiguities due to frequent occlusion and heavy articulation of targets. In this tracker, targets are represented by bounding boxes, which is a coarse representation. However, pixel-wise object segmentation provides fine level information, which is desirable for later tasks. Finally, we propose a tracker that simultaneously solves three main problems: detection, data association and segmentation. This is especially important because the output of each of those three problems are highly correlated and the solution of one can greatly help improve the others. The proposed approach achieves more accurate segmentation results and also helps better resolve typical difficulties in multiple target tracking, such as occlusion, ID-switch and track drifting.
Show less - Date Issued
- 2018
- Identifier
- CFE0007378, ucf:52069
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007378
- Title
- End to End Brain Fiber Orientation Estimation Using Deep Learning.
- Creator
-
Puttashamachar, Nandakishore, Bagci, Ulas, Shah, Mubarak, Rahnavard, Nazanin, Sundaram, Kalpathy, University of Central Florida
- Abstract / Description
-
In this work, we explore the various Brain Neuron tracking techniques, one of the most significant applications of Diffusion Tensor Imaging. Tractography is a non-invasive method to analyze underlying tissue micro-structure. Understanding the structure and organization of the tissues facilitates a diagnosis method to identify any aberrations which can occurwithin tissues due to loss of cell functionalities, provides acute information on the occurrences of brain ischemia or stroke, the...
Show moreIn this work, we explore the various Brain Neuron tracking techniques, one of the most significant applications of Diffusion Tensor Imaging. Tractography is a non-invasive method to analyze underlying tissue micro-structure. Understanding the structure and organization of the tissues facilitates a diagnosis method to identify any aberrations which can occurwithin tissues due to loss of cell functionalities, provides acute information on the occurrences of brain ischemia or stroke, the mutation of certain neurological diseases such as Alzheimer, multiple sclerosis and so on. Under all these circumstances, accurate localization of the aberrations in efficient manner can help save a life. Following up with the limitations introduced by the current Tractography techniques such as computational complexity, reconstruction errors during tensor estimation and standardization, we aim to elucidate these limitations through our research findings. We introduce an End to End Deep Learning framework which can accurately estimate the most probable likelihood orientation at each voxel along a neuronal pathway. We use Probabilistic Tractography as our baseline model to obtain the training data and which also serve as a Tractography Gold Standard for our evaluations. Through experiments we show that our Deep Network can do a significant improvement over current Tractography implementations by reducing the run-time complexity to a significant new level. Our architecture also allows for variable sized input DWI signals eliminating the need to worry about memory issues as seen with the traditional techniques. The advantageof this architecture is that it is perfectly desirable to be processed on a cloud setup and utilize the existing multi GPU frameworks to perform whole brain Tractography in minutes rather than hours. The proposed method is a good alternative to the current state of the art orientation estimation technique which we demonstrate across multiple benchmarks.
Show less - Date Issued
- 2017
- Identifier
- CFE0007292, ucf:52156
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007292
- Title
- Training Neural Networks Through the Integration of Evolution and Gradient Descent.
- Creator
-
Morse, Gregory, Stanley, Kenneth, Wu, Annie, Shah, Mubarak, Wiegand, Rudolf, University of Central Florida
- Abstract / Description
-
Neural networks have achieved widespread adoption due to both their applicability to a wide range of problems and their success relative to other machine learning algorithms. The training of neural networks is achieved through any of several paradigms, most prominently gradient-based approaches (including deep learning), but also through up-and-coming approaches like neuroevolution. However, while both of these neural network training paradigms have seen major improvements over the past...
Show moreNeural networks have achieved widespread adoption due to both their applicability to a wide range of problems and their success relative to other machine learning algorithms. The training of neural networks is achieved through any of several paradigms, most prominently gradient-based approaches (including deep learning), but also through up-and-coming approaches like neuroevolution. However, while both of these neural network training paradigms have seen major improvements over the past decade, little work has been invested in developing algorithms that incorporate the advances from both deep learning and neuroevolution. This dissertation introduces two new algorithms that are steps towards the integration of gradient descent and neuroevolution for training neural networks. The first is (1) the Limited Evaluation Evolutionary Algorithm (LEEA), which implements a novel form of evolution where individuals are partially evaluated, allowing rapid learning and enabling the evolutionary algorithm to behave more like gradient descent. This conception provides a critical stepping stone to future algorithms that more tightly couple evolutionary and gradient descent components. The second major algorithm (2) is Divergent Discriminative Feature Accumulation (DDFA), which combines a neuroevolution phase, where features are collected in an unsupervised manner, with a gradient descent phase for fine tuning of the neural network weights. The neuroevolution phase of DDFA utilizes an indirect encoding and novelty search, which are sophisticated neuroevolution components rarely incorporated into gradient descent-based systems. Further contributions of this work that build on DDFA include (3) an empirical analysis to identify an effective distance function for novelty search in high dimensions and (4) the extension of DDFA for the purpose of discovering convolutional features. The results of these DDFA experiments together show that DDFA discovers features that are effective as a starting point for gradient descent, with significant improvement over gradient descent alone. Additionally, the method of collecting features in an unsupervised manner allows DDFA to be applied to domains with abundant unlabeled data and relatively sparse labeled data. This ability is highlighted in the STL-10 domain, where DDFA is shown to make effective use of unlabeled data.
Show less - Date Issued
- 2019
- Identifier
- CFE0007840, ucf:52819
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007840
- Title
- Optimization Algorithms for Deep Learning Based Medical Image Segmentations.
- Creator
-
Mortazi, Aliasghar, Bagci, Ulas, Shah, Mubarak, Mahalanobis, Abhijit, Pensky, Marianna, University of Central Florida
- Abstract / Description
-
Medical image segmentation is one of the fundamental processes to understand and assess the functionality of different organs and tissues as well as quantifying diseases and helping treatmentplanning. With ever increasing number of medical scans, the automated, accurate, and efficient medical image segmentation is as unmet need for improving healthcare. Recently, deep learn-ing has emerged as one the most powerful methods for almost all image analysis tasks such as segmentation, detection,...
Show moreMedical image segmentation is one of the fundamental processes to understand and assess the functionality of different organs and tissues as well as quantifying diseases and helping treatmentplanning. With ever increasing number of medical scans, the automated, accurate, and efficient medical image segmentation is as unmet need for improving healthcare. Recently, deep learn-ing has emerged as one the most powerful methods for almost all image analysis tasks such as segmentation, detection, and classification and so in medical imaging. In this regard, this dissertation introduces new algorithms to perform medical image segmentation for different (a) imaging modalities, (b) number of objects, (c) dimensionality of images, and (d) under varying labelingconditions. First, we study dimensionality problem by introducing a new 2.5D segmentation engine that can be used in single and multi-object settings. We propose new fusion strategies and loss functions for deep neural networks to generate improved delineations. Later, we expand the proposed idea into 3D and 4D medical images and develop a "budget (computational) friendly"architecture search algorithm to make this process self-contained and fully automated without scarifying accuracy. Instead of manual architecture design, which is often based on plug-in and out and expert experience, the new algorithm provides an automated search of successful segmentation architecture within a short period of time. Finally, we study further optimization algorithms on label noise issue and improve overall segmentation problem by incorporating prior information about label noise and object shape information. We conclude the thesis work by studying different network and hyperparameter optimization settings that are fine-tuned for varying conditions for medical images. Applications are chosen from cardiac scans (images) and efficacy of the proposed algorithms are demonstrated on several data sets publicly available, and independently validated by blind evaluations.
Show less - Date Issued
- 2019
- Identifier
- CFE0007841, ucf:52825
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007841
- Title
- Learning Algorithms for Fat Quantification and Tumor Characterization.
- Creator
-
Hussein, Sarfaraz, Bagci, Ulas, Shah, Mubarak, Heinrich, Mark, Pensky, Marianna, University of Central Florida
- Abstract / Description
-
Obesity is one of the most prevalent health conditions. About 30% of the world's and over 70% of the United States' adult populations are either overweight or obese, causing an increased risk for cardiovascular diseases, diabetes, and certain types of cancer. Among all cancers, lung cancer is the leading cause of death, whereas pancreatic cancer has the poorest prognosis among all major cancers. Early diagnosis of these cancers can save lives. This dissertation contributes towards the...
Show moreObesity is one of the most prevalent health conditions. About 30% of the world's and over 70% of the United States' adult populations are either overweight or obese, causing an increased risk for cardiovascular diseases, diabetes, and certain types of cancer. Among all cancers, lung cancer is the leading cause of death, whereas pancreatic cancer has the poorest prognosis among all major cancers. Early diagnosis of these cancers can save lives. This dissertation contributes towards the development of computer-aided diagnosis tools in order to aid clinicians in establishing the quantitative relationship between obesity and cancers. With respect to obesity and metabolism, in the first part of the dissertation, we specifically focus on the segmentation and quantification of white and brown adipose tissue. For cancer diagnosis, we perform analysis on two important cases: lung cancer and Intraductal Papillary Mucinous Neoplasm (IPMN), a precursor to pancreatic cancer. This dissertation proposes an automatic body region detection method trained with only a single example. Then a new fat quantification approach is proposed which is based on geometric and appearance characteristics. For the segmentation of brown fat, a PET-guided CT co-segmentation method is presented. With different variants of Convolutional Neural Networks (CNN), supervised learning strategies are proposed for the automatic diagnosis of lung nodules and IPMN. In order to address the unavailability of a large number of labeled examples required for training, unsupervised learning approaches for cancer diagnosis without explicit labeling are proposed. We evaluate our proposed approaches (both supervised and unsupervised) on two different tumor diagnosis challenges: lung and pancreas with 1018 CT and 171 MRI scans respectively. The proposed segmentation, quantification and diagnosis approaches explore the important adiposity-cancer association and help pave the way towards improved diagnostic decision making in routine clinical practice.
Show less - Date Issued
- 2018
- Identifier
- CFE0007196, ucf:52288
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007196
- Title
- Task Focused Robotic Imitation Learning.
- Creator
-
Abolghasemi, Pooya, Boloni, Ladislau, Sukthankar, Gita, Shah, Mubarak, Willenberg, Bradley, University of Central Florida
- Abstract / Description
-
For many years, successful applications of robotics were the domain of controlled environments, such as industrial assembly lines. Such environments are custom designed for the convenience of the robot and separated from human operators. In recent years, advances in artificial intelligence, in particular, deep learning and computer vision, allowed researchers to successfully demonstrate robots that operate in unstructured environments and directly interact with humans. One of the major...
Show moreFor many years, successful applications of robotics were the domain of controlled environments, such as industrial assembly lines. Such environments are custom designed for the convenience of the robot and separated from human operators. In recent years, advances in artificial intelligence, in particular, deep learning and computer vision, allowed researchers to successfully demonstrate robots that operate in unstructured environments and directly interact with humans. One of the major applications of such robots is in assistive robotics. For instance, a wheelchair mounted robotic arm can help disabled users in the performance of activities of daily living (ADLs) such as feeding and personal grooming. Early systems relied entirely on the control of the human operator, something that is difficult to accomplish by a user with motor and/or cognitive disabilities. In this dissertation, we are describing research results that advance the field of assistive robotics. The overall goal is to improve the ability of the wheelchair / robotic arm assembly to help the user with the performance of the ADLs by requiring only high-level commands from the user. Let us consider an ADL involving the manipulation of an object in the user's home. This task can be naturally decomposed into two components: the movement of the wheelchair in such a way that the manipulator can conveniently grasp the object and the movement of the manipulator itself. This dissertation we provide an approach for addressing the challenge of finding the position appropriate for the required manipulation. We introduce the ease-of-reach score (ERS), a metric that quantifies the preferences for the positioning of the base while taking into consideration the shape and position of obstacles and clutter in the environment. As the brute force computation of ERS is computationally expensive, we propose a machine learning approach to estimate the ERS based on features and characteristics of the obstacles. This dissertation addresses the second component as well, the ability of the robotic arm to manipulate objects. Recent work in end-to-end learning of robotic manipulation had demonstrated that a deep learning-based controller of vision-enabled robotic arms can be thought to manipulate objects from a moderate number of demonstrations. However, the current state of the art systems are limited in robustness to physical and visual disturbances and do not generalize well to new objects. We describe new techniques based on task-focused attention that show significant improvement in the robustness of manipulation and performance in clutter.
Show less - Date Issued
- 2019
- Identifier
- CFE0007771, ucf:52392
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007771
- Title
- A Psychophysical Approach to Standardizing Texture Compression for Virtual Environments.
- Creator
-
Flynn, Jeremy, Szalma, James, Fidopiastis, Cali, Jentsch, Florian, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Image compression is a technique to reduce overall data size, but its effects on human perception have not been clearly established. The purpose of this effort was to determine the most effective psychophysical method for subjective image quality assessment, and to apply those findings to an objective algorithm. This algorithm was used to identify the minimum level of texture compression noticeable to the human, in order to determine whether compression-induced texture distortion impacted...
Show moreImage compression is a technique to reduce overall data size, but its effects on human perception have not been clearly established. The purpose of this effort was to determine the most effective psychophysical method for subjective image quality assessment, and to apply those findings to an objective algorithm. This algorithm was used to identify the minimum level of texture compression noticeable to the human, in order to determine whether compression-induced texture distortion impacted game-play outcomes. Four experiments tested several hypotheses. The first hypothesis evaluated which of three magnitude estimation (ME) methods (absolute ME, absolute ME plus, or ME with a standard) for image quality assessment was the most reliable. The just noticeable difference (JND) point for textures compression against the Feature Similarity Index for color was determined The second hypothesis tested whether human participants perceived the same amount of distortion differently when textures were presented in three ways: when textures were displayed as flat images; when textures were wrapped around a model; and when textures were wrapped around models and in a virtual environment. The last set of hypotheses examined whether compression affected both subjective (immersion, technology acceptance, usability) and objective (performance) gameplay outcomes. The results were: the absolute magnitude estimation method was the most reliable; no difference was observed in the JND threshold between flat textures and textures placed on models, but textured embedded within the virtual environment were more noticeable than in the other two presentation formats. There were no differences in subjective gameplay outcomes when textures were compressed to below the JND thresholds; and those who played a game with uncompressed textures performed better on in-game tasks than those with the textures compressed, but only on the first in-game day. Practitioners and researchers can use these findings to guide their approaches to texture compression and experimental design.
Show less - Date Issued
- 2018
- Identifier
- CFE0007178, ucf:52250
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007178
- Title
- Action Recognition, Temporal Localization and Detection in Trimmed and Untrimmed Video.
- Creator
-
Hou, Rui, Shah, Mubarak, Mahalanobis, Abhijit, Hua, Kien, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
Automatic understanding of videos is one of the most active areas of computer vision research. It has applications in video surveillance, human computer interaction, video sports analysis, virtual and augmented reality, video retrieval etc. In this dissertation, we address four important tasks in video understanding, namely action recognition, temporal action localization, spatial-temporal action detection and video object/action segmentation. This dissertation makes contributions to above...
Show moreAutomatic understanding of videos is one of the most active areas of computer vision research. It has applications in video surveillance, human computer interaction, video sports analysis, virtual and augmented reality, video retrieval etc. In this dissertation, we address four important tasks in video understanding, namely action recognition, temporal action localization, spatial-temporal action detection and video object/action segmentation. This dissertation makes contributions to above tasks by proposing. First, for video action recognition, we propose a category level feature learning method. Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, and a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. Second, for temporal action localization, we propose to exploit the temporal structure of actions by modeling an action as a sequence of sub-actions and present a computationally efficient approach. Third, we propose 3D Tube Convolutional Neural Network (TCNN) based pipeline for action detection. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. It generalizes the popular faster R-CNN framework from images to videos. Last, an end-to-end encoder-decoder based 3D convolutional neural network pipeline is proposed, which is able to segment out the foreground objects from the background. Moreover, the action label can be obtained as well by passing the foreground object into an action classifier. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for video understanding compared to the state-of-the-art.
Show less - Date Issued
- 2019
- Identifier
- CFE0007655, ucf:52502
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007655
- Title
- Describing Images by Semantic Modeling using Attributes and Tags.
- Creator
-
Mahmoudkalayeh, Mahdi, Shah, Mubarak, Sukthankar, Gita, Rahnavard, Nazanin, Zhang, Teng, University of Central Florida
- Abstract / Description
-
This dissertation addresses the problem of describing images using visual attributes and textual tags, a fundamental task that narrows down the semantic gap between the visual reasoning of humans and machines. Automatic image annotation assigns relevant textual tags to the images. In this dissertation, we propose a query-specific formulation based on Weighted Multi-view Non-negative Matrix Factorization to perform automatic image annotation. Our proposed technique seamlessly adapt to the...
Show moreThis dissertation addresses the problem of describing images using visual attributes and textual tags, a fundamental task that narrows down the semantic gap between the visual reasoning of humans and machines. Automatic image annotation assigns relevant textual tags to the images. In this dissertation, we propose a query-specific formulation based on Weighted Multi-view Non-negative Matrix Factorization to perform automatic image annotation. Our proposed technique seamlessly adapt to the changes in training data, naturally solves the problem of feature fusion and handles the challenge of the rare tags. Unlike tags, attributes are category-agnostic, hence their combination models an exponential number of semantic labels. Motivated by the fact that most attributes describe local properties, we propose exploiting localization cues, through semantic parsing of human face and body to improve person-related attribute prediction. We also demonstrate that image-level attribute labels can be effectively used as weak supervision for the task of semantic segmentation. Next, we analyze the Selfie images by utilizing tags and attributes. We collect the first large-scale Selfie dataset and annotate it with different attributes covering characteristics such as gender, age, race, facial gestures, and hairstyle. We then study the popularity and sentiments of the selfies given an estimated appearance of various semantic concepts. In brief, we automatically infer what makes a good selfie. Despite its extensive usage, the deep learning literature falls short in understanding the characteristics and behavior of the Batch Normalization. We conclude this dissertation by providing a fresh view, in light of information geometry and Fisher kernels to why the batch normalization works. We propose Mixture Normalization that disentangles modes of variation in the underlying distribution of the layer outputs and confirm that it effectively accelerates training of different batch-normalized architectures including Inception-V3, Densely Connected Networks, and Deep Convolutional Generative Adversarial Networks while achieving better generalization error.
Show less - Date Issued
- 2019
- Identifier
- CFE0007493, ucf:52640
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007493
- Title
- Detecting, Tracking, and Recognizing Activities in Aerial Video.
- Creator
-
Reilly, Vladimir, Shah, Mubarak, Georgiopoulos, Michael, Stanley, Kenneth, Dogariu, Aristide, University of Central Florida
- Abstract / Description
-
In this dissertation we address the problem of detecting humans and vehicles, tracking their identities in crowded scenes, and finally determining human activities. First, we tackle the problem of detecting moving as well as stationary objects in scenes that contain parallax and shadows. We constrain the search of pedestrians and vehicles by representing them as shadow casting out of plane or (SCOOP) objects.Next, we propose a novel method for tracking a large number of densely moving objects...
Show moreIn this dissertation we address the problem of detecting humans and vehicles, tracking their identities in crowded scenes, and finally determining human activities. First, we tackle the problem of detecting moving as well as stationary objects in scenes that contain parallax and shadows. We constrain the search of pedestrians and vehicles by representing them as shadow casting out of plane or (SCOOP) objects.Next, we propose a novel method for tracking a large number of densely moving objects in aerial video. We divide the scene into grid cells to define a set of local scene constraints which we use as part of the matching cost function to solve the tracking problem which allows us to track fast-moving objects in low frame rate videos.Finally, we propose a method for recognizing human actions from few examples. We use the bag of words action representation, assume that most of the classes have many examples, and construct Support Vector Machine models for each class. We then use Support Vector Machines for classes with many examples to improve the decision function of the Support Vector Machine that was trained using few examples via late fusion of weighted decision values.
Show less - Date Issued
- 2012
- Identifier
- CFE0004627, ucf:49935
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004627
- Title
- ACTION RECOGNITION USING PARTICLE FLOW FIELDS.
- Creator
-
Reddy, Kishore, Shah, Mubarak, Sukthankar, Gita, Wei, Lei, Moore, Brian, University of Central Florida
- Abstract / Description
-
In recent years, research in human action recognition has advanced on multiple fronts to address various types of actions including simple, isolated actions in staged data (e.g., KTH dataset), complex actions (e.g., Hollywood dataset), and naturally occurring actions in surveillance videos (e.g, VIRAT dataset). Several techniques including those based on gradient, flow, and interest-points, have been developed for their recognition. Most perform very well in standard action recognition...
Show moreIn recent years, research in human action recognition has advanced on multiple fronts to address various types of actions including simple, isolated actions in staged data (e.g., KTH dataset), complex actions (e.g., Hollywood dataset), and naturally occurring actions in surveillance videos (e.g, VIRAT dataset). Several techniques including those based on gradient, flow, and interest-points, have been developed for their recognition. Most perform very well in standard action recognition datasets, but fail to produce similar results in more complex, large-scale datasets. Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (six actions), IXMAS (thirteen actions), and Weizmann (ten actions). Challenges such as camera motion, different viewpoints, huge interclass variations, cluttered background, occlusions, bad illumination conditions, and poor quality of web videos cause the majority of the state-of-the-art action recognition approaches to fail. An increasing number of categories and the inclusion of actions with high confusion also increase the difficulty of the problem. The approach taken to solve this action recognition problem depends primarily on the dataset and the possibility of detecting and tracking the object of interest. In this dissertation, a new method for video representation is proposed and three new approaches to perform action recognition in different scenarios using varying prerequisites are presented. The prerequisites have decreasing levels of difficulty to obtain: 1) Scenario requires human detection and tracking to perform action recognition; 2) Scenario requires background and foreground separation to perform action recognition; and 3) No pre-processing is required for action recognition.First, we propose a new video representation using optical flow and particle advection. The proposed ``Particle Flow Field'' (PFF) representation has been used to generate motion descriptors and tested in a Bag of Video Words (BoVW) framework on the KTH dataset. We show that particle flow fields has better performance than other low-level video representations, such as 2D-Gradients, 3D-Gradients and optical flow. Second, we analyze the performance of the state-of-the-art technique based on the histogram of oriented 3D-Gradients in spatio temporal volumes, where human detection and tracking are required. We use the proposed particle flow field and show superior results compared to the histogram of oriented 3D-Gradients in spatio temporal volumes. The proposed method, when used for human action recognition, just needs human detection and does not necessarily require human tracking and figure centric bounding boxes. It has been tested on KTH (six actions), Weizmann (ten actions), and IXMAS (thirteen actions, 4 different views) action recognition datasets.Third, we propose using the scene context information obtained from moving and stationary pixels in the key frames, in conjunction with motion descriptors obtained using Bag of Words framework, to solve the action recognition problem on a large (50 actions) dataset with videos from the web. We perform a combination of early and late fusion on multiple features to handle the huge number of categories. We demonstrate that scene context is a very important feature for performing action recognition on huge datasets.The proposed method needs separation of moving and stationary pixels, and does not require any kind of video stabilization, person detection, or tracking and pruning of features. Our approach obtains good performance on a huge number of action categories. It has been tested on the UCF50 dataset with 50 action categories, which is an extension of the UCF YouTube Action (UCF11) Dataset containing 11 action categories. We also tested our approach on the KTH and HMDB51 datasets for comparison.Finally, we focus on solving practice problems in representing actions by bag of spatio temporal features (i.e. cuboids), which has proven valuable for action recognition in recent literature. We observed that the visual vocabulary based (bag of video words) method suffers from many drawbacks in practice, such as: (i) It requires an intensive training stage to obtain good performance; (ii) it is sensitive to the vocabulary size; (iii) it is unable to cope with incremental recognition problems; (iv) it is unable to recognize simultaneous multiple actions; (v) it is unable to perform recognition frame by frame. In order to overcome these drawbacks, we propose a framework to index large scale motion features using Sphere/Rectangle-tree (SR-tree) for incremental action detection and recognition. The recognition comprises of the following two steps: 1) recognizing the local features by non-parametric nearest neighbor (NN), and 2) using a simple voting strategy to label the action. It can also provide localization of the action. Since it does not require feature quantization it can efficiently grow the feature-tree by adding features from new training actions or categories. Our method provides an effective way for practical incremental action recognition. Furthermore, it can handle large scale datasets because the SR-tree is a disk-based data structure. We tested our approach on two publicly available datasets, the KTH dataset and the IXMAS multi-view dataset, and achieved promising results.
Show less - Date Issued
- 2012
- Identifier
- CFE0004626, ucf:49923
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004626
- Title
- Visual-Textual Video Synopsis Generation.
- Creator
-
Sharghi Karganroodi, Aidean, Shah, Mubarak, Da Vitoria Lobo, Niels, Rahnavard, Nazanin, Atia, George, University of Central Florida
- Abstract / Description
-
In this dissertation we tackle the problem of automatic video summarization. Automatic summarization techniques enable faster browsing and indexing of large video databases. However, due to the inherent subjectivity of the task, no single video summarizer fits all users unless it adapts to individual user's needs. To address this issue, we introduce a fresh view on the task called "Query-focused'' extractive video summarization. We develop a supervised model that takes as input a video and...
Show moreIn this dissertation we tackle the problem of automatic video summarization. Automatic summarization techniques enable faster browsing and indexing of large video databases. However, due to the inherent subjectivity of the task, no single video summarizer fits all users unless it adapts to individual user's needs. To address this issue, we introduce a fresh view on the task called "Query-focused'' extractive video summarization. We develop a supervised model that takes as input a video and user's preference in form of a query, and creates a summary video by selecting key shots from the original video. We model the problem as subset selection via determinantal point process (DPP), a stochastic point process that assigns a probability value to each subset of any given set. Next, we develop a second model that exploits capabilities of memory networks in the framework and concomitantly reduces the level of supervision required to train the model. To automatically evaluate system summaries, we contend that a good metric for video summarization should focus on the semantic information that humans can perceive rather than the visual features or temporal overlaps. To this end, we collect dense per-video-shot concept annotations, compile a new dataset, and suggest an efficient evaluation method defined upon the concept annotations. To enable better summarization of videos, we improve the sequential DPP in two folds. In terms of learning, we propose a large-margin algorithm to address the exposure bias that is common in many sequence to sequence learning methods. In terms of modeling, we integrate a new probabilistic distribution into SeqDPP, the resulting model accepts user input about the expected length of the summary. We conclude this dissertation by developing a framework to generate textual synopsis for a video, thus, enabling users to quickly browse a large video database without watching the videos.
Show less - Date Issued
- 2019
- Identifier
- CFE0007862, ucf:52756
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007862
- Title
- Relating First-person and Third-person Vision.
- Creator
-
Ardeshir Behrostaghi, Shervin, Borji, Ali, Shah, Mubarak, Hu, Haiyan, Atia, George, University of Central Florida
- Abstract / Description
-
Thanks to the availability and increasing popularity of wearable devices such as GoPro cameras, smart phones and glasses, we have access to a plethora of videos captured from the first person (egocentric) perspective. Capturing the world from the perspective of one's self, egocentric videos bear characteristics distinct from the more traditional third-person (exocentric) videos. In many computer vision tasks (e.g. identification, action recognition, face recognition, pose estimation, etc.),...
Show moreThanks to the availability and increasing popularity of wearable devices such as GoPro cameras, smart phones and glasses, we have access to a plethora of videos captured from the first person (egocentric) perspective. Capturing the world from the perspective of one's self, egocentric videos bear characteristics distinct from the more traditional third-person (exocentric) videos. In many computer vision tasks (e.g. identification, action recognition, face recognition, pose estimation, etc.), the human actors are the main focus. Hence, detecting, localizing, and recognizing the human actor is often incorporated as a vital component. In an egocentric video however, the person behind the camera is often the person of interest. This would change the nature of the task at hand, given that the camera holder is usually not visible in the content of his/her egocentric video. In other words, our knowledge about the visual appearance, pose, etc. on the egocentric camera holder is very limited, suggesting reliance on other cues in first person videos. First and third person videos have been separately studied in the past in the computer vision community. However, the relationship between first and third person vision has yet to be fully explored. Relating these two views systematically could potentially benefit many computer vision tasks and applications. This thesis studies this relationship in several aspects. We explore supervised and unsupervised approaches for relating these two views seeking different objectives such as identification, temporal alignment, and action classification. We believe that this exploration could lead to a better understanding the relationship of these two drastically different sources of information.
Show less - Date Issued
- 2018
- Identifier
- CFE0007151, ucf:52322
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007151
- Title
- Analysis of Behaviors in Crowd Videos.
- Creator
-
Mehran, Ramin, Shah, Mubarak, Sukthankar, Gita, Behal, Aman, Tappen, Marshall, Moore, Brian, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of discovery and representation of group activity of humans and objects in a variety of scenarios, commonly encountered in vision applications. The overarching goal is to devise a discriminative representation of human motion in social settings, which captures a wide variety of human activities observable in video sequences. Such motion emerges from the collective behavior of individuals and their interactions and is a significant source of...
Show moreIn this dissertation, we address the problem of discovery and representation of group activity of humans and objects in a variety of scenarios, commonly encountered in vision applications. The overarching goal is to devise a discriminative representation of human motion in social settings, which captures a wide variety of human activities observable in video sequences. Such motion emerges from the collective behavior of individuals and their interactions and is a significant source of information typically employed for applications such as event detection, behavior recognition, and activity recognition. We present new representations of human group motion for static cameras, and propose algorithms for their application to variety of problems.We first propose a method to model and learn the scene activity of a crowd using Social Force Model for the first time in the computer vision community. We present a method to densely estimate the interaction forces between people in a crowd, observed by a static camera. Latent Dirichlet Allocation (LDA) is used to learn the model of the normal activities over extended periods of time. Randomly selected spatio-temporal volumes of interaction forces are used to learn the model of normal behavior of the scene. The model encodes the latent topics of social interaction forces in the scene for normal behaviors. We classify a short video sequence of $n$ frames as normal or abnormal by using the learnt model. Once a sequence of frames is classified as an abnormal, the regions of anomalies in the abnormal frames are localized using the magnitude of interaction forces.The representation and estimation framework proposed above, however, has a few limitations. This algorithm proposes to use a global estimation of the interaction forces within the crowd. It, therefore, is incapable of identifying different groups of objects based on motion or behavior in the scene. Although the algorithm is capable of learning the normal behavior and detects the abnormality, but it is incapable of capturing the dynamics of different behaviors.To overcome these limitations, we then propose a method based on the Lagrangian framework for fluid dynamics, by introducing a streakline representation of flow. Streaklines are traced in a fluid flow by injecting color material, such as smoke or dye, which is transported with the flow and used for visualization. In the context of computer vision, streaklines may be used in a similar way to transport information about a scene, and they are obtained by repeatedly initializing a fixed grid of particles at each frame, then moving both current and past particles using optical flow. Streaklines are the locus of points that connect particles which originated from the same initial position.This approach is advantageous over the previous representations in two aspects: first, its rich representation captures the dynamics of the crowd and changes in space and time in the scene where the optical flow representation is not enough, and second, this model is capable of discovering groups of similar behavior within a crowd scene by performing motion segmentation. We propose a method to distinguish different group behaviors such as divergent/convergent motion and lanes using this framework. Finally, we introduce flow potentials as a discriminative feature to recognize crowd behaviors in a scene. Results of extensive experiments are presented for multiple real life crowd sequences involving pedestrian and vehicular traffic.The proposed method exploits optical flow as the low level feature and performs integration and clustering to obtain coherent group motion patterns. However, we observe that in crowd video sequences, as well as a variety of other vision applications, the co-occurrence and inter-relation of motion patterns are the main characteristics of group behaviors. In other words, the group behavior of objects is a mixture of individual actions or behaviors in specific geometrical layout and temporal order.We, therefore, propose a new representation for group behaviors of humans using the inter-relation of motion patterns in a scene. The representation is based on bag of visual phrases of spatio-temporal visual words. We present a method to match the high-order spatial layout of visual words that preserve the geometry of the visual words under similarity transformations. To perform the experiments we collected a dataset of group choreography performances from the YouTube website. The dataset currently contains four categories of group dances.
Show less - Date Issued
- 2011
- Identifier
- CFE0004482, ucf:49317
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004482
- Title
- Learning Hierarchical Representations for Video Analysis Using Deep Learning.
- Creator
-
Yang, Yang, Shah, Mubarak, Sukthankar, Gita, Da Vitoria Lobo, Niels, Stanley, Kenneth, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
With the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non...
Show moreWith the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non-linear transformations of the data, with the goal of yielding more abstract and ultimately more useful representations. The advantages of the deep models are three fold: 1) They learn the features directly from the raw signal in contrast to the hand-designed features. 2) The learning can be unsupervised, which is suitable for large data where labeling all the data is expensive and unpractical. 3) They learn a hierarchy of features one level at a time and the layerwise stacking of feature extraction, this often yields better representations.However, not many deep learning models have been proposed to solve the problems in video analysis, especially videos ``in a wild''. Most of them are either dealing with simple datasets, or limited to the low-level local spatial-temporal feature descriptors for action recognition. Moreover, as the learning algorithms are unsupervised, the learned features preserve generative properties rather than the discriminative ones which are more favorable in the classification tasks. In this context, the thesis makes two major contributions.First, we propose several formulations and extensions of deep learning methods which learn hierarchical representations for three challenging video analysis tasks, including complex event recognition, object detection in videos and measuring action similarity. The proposed methods are extensively demonstrated for each work on the state-of-the-art challenging datasets. Besides learning the low-level local features, higher level representations are further designed to be learned in the context of applications. The data-driven concept representations and sparse representation of the events are learned for complex event recognition; the representations for object body parts and structures are learned for object detection in videos; and the relational motion features and similarity metrics between video pairs are learned simultaneously for action verification.Second, in order to learn discriminative and compact features, we propose a new feature learning method using a deep neural network based on auto encoders. It differs from the existing unsupervised feature learning methods in two ways: first it optimizes both discriminative and generative properties of the features simultaneously, which gives our features a better discriminative ability. Second, our learned features are more compact, while the unsupervised feature learning methods usually learn a redundant set of over-complete features. Extensive experiments with quantitative and qualitative results on the tasks of human detection and action verification demonstrate the superiority of our proposed models.
Show less - Date Issued
- 2013
- Identifier
- CFE0004964, ucf:49593
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004964
- Title
- Spatiotemporal Graphs for Object Segmentation and Human Pose Estimation in Videos.
- Creator
-
Zhang, Dong, Shah, Mubarak, Qi, GuoJun, Bagci, Ulas, Yun, Hae-Bum, University of Central Florida
- Abstract / Description
-
Images and videos can be naturally represented by graphs, with spatial graphs for images and spatiotemporal graphs for videos. However, for different applications, there are usually different formulations of the graphs, and algorithms for each formulation have different complexities. Therefore, wisely formulating the problem to ensure an accurate and efficient solution is one of the core issues in Computer Vision research. We explore three problems in this domain to demonstrate how to...
Show moreImages and videos can be naturally represented by graphs, with spatial graphs for images and spatiotemporal graphs for videos. However, for different applications, there are usually different formulations of the graphs, and algorithms for each formulation have different complexities. Therefore, wisely formulating the problem to ensure an accurate and efficient solution is one of the core issues in Computer Vision research. We explore three problems in this domain to demonstrate how to formulate all of these problems in terms of spatiotemporal graphs and obtain good and efficient solutions.The first problem we explore is video object segmentation. The goal is to segment the primary moving objects in the videos. This problem is important for many applications, such as content based video retrieval, video summarization, activity understanding and targeted content replacement. In our framework, we use object proposals, which are object-like regions obtained by low-level visual cues. Each object proposal has an object-ness score associated with it, which indicates how likely this object proposal corresponds to an object. The problem is formulated as a directed acyclic graph, for which nodes represent the object proposals and edges represent the spatiotemporal relationship between nodes. A dynamic programming solution is employed to select one object proposal from each video frame, while ensuring their consistency throughout the video frames. Gaussian mixture models (GMMs) are used for modeling the background and foreground, and Markov Random Fields (MRFs) are employed to smooth the pixel-level segmentation.In the above spatiotemporal graph formulation, we consider the object segmentation in only single video. Next, we consider multiple videos and model the video co-segmentation problem as a spatiotemporal graph. The goal here is to simultaneously segment the moving objects from multiple videos and assign common objects the same labels. The problem is formulated as a regulated maximum clique problem using object proposals. The object proposals are tracked in adjacent frames to generate a pool of candidate tracklets. Then an undirected graph is built with the nodes corresponding to the tracklets from all the videos and edges representing the similarities between the tracklets. A modified Bron-Kerbosch Algorithm is applied to the graph in order to select the prominent objects contained in these videos, hence relate the segmentation of each object in different videos.In online and surveillance videos, the most important object class is the human. In contrast to generic video object segmentation and co-segmentation, specific knowledge about humans, which is defined by a pose (i.e. human skeleton), can be employed to help the segmentation and tracking of people in the videos. We formulate the problem of human pose estimation in videos using the spatiotemporal graph. In this formulation, the nodes represent different body parts in the video frames and edges represent the spatiotemporal relationship between body parts in adjacent frames. The graph is carefully designed to ensure an exact and efficient solution. The overall objective for the new formulation is to remove the simple cycles from the traditional graph-based formulations. Dynamic programming is employed in different stages in the method to select the best tracklets and human pose configurations
Show less - Date Issued
- 2016
- Identifier
- CFE0006429, ucf:51488
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006429
- Title
- Weighted Low-Rank Approximation of Matrices:Some Analytical and Numerical Aspects.
- Creator
-
Dutta, Aritra, Li, Xin, Sun, Qiyu, Mohapatra, Ram, Nashed, M, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
This dissertation addresses some analytical and numerical aspects of a problem of weighted low-rank approximation of matrices. We propose and solve two different versions of weighted low-rank approximation problems. We demonstrate, in addition, how these formulations can be efficiently used to solve some classic problems in computer vision. We also present the superior performance of our algorithms over the existing state-of-the-art unweighted and weighted low-rank approximation algorithms...
Show moreThis dissertation addresses some analytical and numerical aspects of a problem of weighted low-rank approximation of matrices. We propose and solve two different versions of weighted low-rank approximation problems. We demonstrate, in addition, how these formulations can be efficiently used to solve some classic problems in computer vision. We also present the superior performance of our algorithms over the existing state-of-the-art unweighted and weighted low-rank approximation algorithms.Classical principal component analysis (PCA) is constrained to have equal weighting on the elements of the matrix, which might lead to a degraded design in some problems. To address this fundamental flaw in PCA, Golub, Hoffman, and Stewart proposed and solved a problem of constrained low-rank approximation of matrices: For a given matrix $A = (A_1\;A_2)$, find a low rank matrix $X = (A_1\;X_2)$ such that ${\rm rank}(X)$ is less than $r$, a prescribed bound, and $\|A-X\|$ is small.~Motivated by the above formulation, we propose a weighted low-rank approximation problem that generalizes the constrained low-rank approximation problem of Golub, Hoffman and Stewart.~We study a general framework obtained by pointwise multiplication with the weight matrix and consider the following problem:~For a given matrix $A\in\mathbb{R}^{m\times n}$ solve:\begin{eqnarray*}\label{weighted problem}\min_{\substack{X}}\|\left(A-X\right)\odot W\|_F^2~{\rm subject~to~}{\rm rank}(X)\le r,\end{eqnarray*}where $\odot$ denotes the pointwise multiplication and $\|\cdot\|_F$ is the Frobenius norm of matrices.In the first part, we study a special version of the above general weighted low-rank approximation problem.~Instead of using pointwise multiplication with the weight matrix, we use the regular matrix multiplication and replace the rank constraint by its convex surrogate, the nuclear norm, and consider the following problem:\begin{eqnarray*}\label{weighted problem 1}\hat{X} (&)=(&) \arg \min_X \{\frac{1}{2}\|(A-X)W\|_F^2 +\tau\|X\|_\ast\},\end{eqnarray*}where $\|\cdot\|_*$ denotes the nuclear norm of $X$.~Considering its resemblance with the classic singular value thresholding problem we call it the weighted singular value thresholding~(WSVT)~problem.~As expected,~the WSVT problem has no closed form analytical solution in general,~and a numerical procedure is needed to solve it.~We introduce auxiliary variables and apply simple and fast alternating direction method to solve WSVT numerically.~Moreover, we present a convergence analysis of the algorithm and propose a mechanism for estimating the weight from the data.~We demonstrate the performance of WSVT on two computer vision applications:~background estimation from video sequences~and facial shadow removal.~In both cases,~WSVT shows superior performance to all other models traditionally used. In the second part, we study the general framework of the proposed problem.~For the special case of weight, we study the limiting behavior of the solution to our problem,~both analytically and numerically.~In the limiting case of weights,~as $(W_1)_{ij}\to\infty, W_2=\mathbbm{1}$, a matrix of 1,~we show the solutions to our weighted problem converge, and the limit is the solution to the constrained low-rank approximation problem of Golub et. al. Additionally, by asymptotic analysis of the solution to our problem,~we propose a rate of convergence.~By doing this, we make explicit connections between a vast genre of weighted and unweighted low-rank approximation problems.~In addition to these, we devise a novel and efficient numerical algorithm based on the alternating direction method for the special case of weight and present a detailed convergence analysis.~Our approach improves substantially over the existing weighted low-rank approximation algorithms proposed in the literature.~Finally, we explore the use of our algorithm to real-world problems in a variety of domains, such as computer vision and machine learning. Finally, for a special family of weights, we demonstrate an interesting property of the solution to the general weighted low-rank approximation problem. Additionally, we devise two accelerated algorithms by using this property and present their effectiveness compared to the algorithm proposed in Chapter 4.
Show less - Date Issued
- 2016
- Identifier
- CFE0006833, ucf:51789
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006833
- Title
- Weakly Labeled Action Recognition and Detection.
- Creator
-
Sultani, Waqas, Shah, Mubarak, Bagci, Ulas, Qi, GuoJun, Yun, Hae-Bum, University of Central Florida
- Abstract / Description
-
Research in human action recognition strives to develop increasingly generalized methods thatare robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendousstrides in improving recognition accuracy on ever larger and complex benchmark datasets, comprisingrealistic actions (")in the wild(") videos. Unfortunately, the all-encompassing, dense, globalrepresentations that bring about such improvements often benefit from the inherent characteristics,specific to...
Show moreResearch in human action recognition strives to develop increasingly generalized methods thatare robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendousstrides in improving recognition accuracy on ever larger and complex benchmark datasets, comprisingrealistic actions (")in the wild(") videos. Unfortunately, the all-encompassing, dense, globalrepresentations that bring about such improvements often benefit from the inherent characteristics,specific to datasets and classes, that do not necessarily reflect knowledge about the entity to berecognized. This results in specific models that perform well within datasets but generalize poorly.Furthermore, training of supervised action recognition and detection methods need several precisespatio-temporal manual annotations to achieve good recognition and detection accuracy. For instance,current deep learning architectures require millions of accurately annotated videos to learnrobust action classifiers. However, these annotations are quite difficult to achieve.In the first part of this dissertation, we explore the reasons for poor classifier performance whentested on novel datasets, and quantify the effect of scene backgrounds on action representationsand recognition. We attempt to address the problem of recognizing human actions while trainingand testing on distinct datasets when test videos are neither labeled nor available during training. Inthis scenario, learning of a joint vocabulary, or domain transfer techniques are not applicable. Weperform different types of partitioning of the GIST feature space for several datasets and computemeasures of background scene complexity, as well as, for the extent to which scenes are helpfulin action classification. We then propose a new process to obtain a measure of confidence in eachpixel of the video being a foreground region using motion, appearance, and saliency together in a3D-Markov Random Field (MRF) based framework. We also propose multiple ways to exploit theforeground confidence: to improve bag-of-words vocabulary, histogram representation of a video,and a novel histogram decomposition based representation and kernel.iiiThe above-mentioned work provides probability of each pixel being belonging to the actor, however,it does not give the precise spatio-temporal location of the actor. Furthermore, above frameworkwould require precise spatio-temporal manual annotations to train an action detector. However,manual annotations in videos are laborious, require several annotators and contain humanbiases. Therefore, in the second part of this dissertation, we propose a weakly labeled approachto automatically obtain spatio-temporal annotations of actors in action videos. We first obtain alarge number of action proposals in each video. To capture a few most representative action proposalsin each video and evade processing thousands of them, we rank them using optical flow andsaliency in a 3D-MRF based framework and select a few proposals using MAP based proposal subsetselection method. We demonstrate that this ranking preserves the high-quality action proposals.Several such proposals are generated for each video of the same action. Our next challenge is toiteratively select one proposal from each video so that all proposals are globally consistent. Weformulate this as Generalized Maximum Clique Graph problem (GMCP) using shape, global andfine-grained similarity of proposals across the videos. The output of our method is the most actionrepresentative proposals from each video. Using our method can also annotate multiple instancesof the same action in a video can also be annotated. Moreover, action detection experiments usingannotations obtained by our method and several baselines demonstrate the superiority of ourapproach.The above-mentioned annotation method uses multiple videos of the same action. Therefore, inthe third part of this dissertation, we tackle the problem of spatio-temporal action localization in avideo, without assuming the availability of multiple videos or any prior annotations. The action islocalized by employing images downloaded from the Internet using action label. Given web images,we first dampen image noise using random walk and evade distracting backgrounds withinimages using image action proposals. Then, given a video, we generate multiple spatio-temporalaction proposals. We suppress camera and background generated proposals by exploiting opticalivflow gradients within proposals. To obtain the most action representative proposals, we propose toreconstruct action proposals in the video by leveraging the action proposals in images. Moreover,we preserve the temporal smoothness of the video and reconstruct all proposal bounding boxesjointly using the constraints that push the coefficients for each bounding box toward a commonconsensus, thus enforcing the coefficient similarity across multiple frames. We solve this optimizationproblem using the variant of two-metric projection algorithm. Finally, the video proposalthat has the lowest reconstruction cost and is motion salient is used to localize the action. Ourmethod is not only applicable to the trimmed videos, but it can also be used for action localizationin untrimmed videos, which is a very challenging problem.Finally, in the third part of this dissertation, we propose a novel approach to generate a few properlyranked action proposals from a large number of noisy proposals. The proposed approach beginswith dividing each proposal into sub-proposals. We assume that the quality of proposal remainsthe same within each sub-proposal. We, then employ a graph optimization method to recombinethe sub-proposals in all action proposals in a single video in order to optimally build new actionproposals and rank them by the combined node and edge scores. For an untrimmed video, we firstdivide the video into shots and then make the above-mentioned graph within each shot. Our methodgenerates a few ranked proposals that can be better than all the existing underlying proposals. Ourexperimental results validated that the properly ranked action proposals can significantly boostaction detection results.Our extensive experimental results on different challenging and realistic action datasets, comparisonswith several competitive baselines and detailed analysis of each step of proposed methodsvalidate the proposed ideas and frameworks.
Show less - Date Issued
- 2017
- Identifier
- CFE0006801, ucf:51809
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006801