Current Search: action recognition (x)
View All Items
- Title
- ACTION RECOGNITION USING PARTICLE FLOW FIELDS.
- Creator
-
Reddy, Kishore, Shah, Mubarak, Sukthankar, Gita, Wei, Lei, Moore, Brian, University of Central Florida
- Abstract / Description
-
In recent years, research in human action recognition has advanced on multiple fronts to address various types of actions including simple, isolated actions in staged data (e.g., KTH dataset), complex actions (e.g., Hollywood dataset), and naturally occurring actions in surveillance videos (e.g, VIRAT dataset). Several techniques including those based on gradient, flow, and interest-points, have been developed for their recognition. Most perform very well in standard action recognition...
Show moreIn recent years, research in human action recognition has advanced on multiple fronts to address various types of actions including simple, isolated actions in staged data (e.g., KTH dataset), complex actions (e.g., Hollywood dataset), and naturally occurring actions in surveillance videos (e.g, VIRAT dataset). Several techniques including those based on gradient, flow, and interest-points, have been developed for their recognition. Most perform very well in standard action recognition datasets, but fail to produce similar results in more complex, large-scale datasets. Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (six actions), IXMAS (thirteen actions), and Weizmann (ten actions). Challenges such as camera motion, different viewpoints, huge interclass variations, cluttered background, occlusions, bad illumination conditions, and poor quality of web videos cause the majority of the state-of-the-art action recognition approaches to fail. An increasing number of categories and the inclusion of actions with high confusion also increase the difficulty of the problem. The approach taken to solve this action recognition problem depends primarily on the dataset and the possibility of detecting and tracking the object of interest. In this dissertation, a new method for video representation is proposed and three new approaches to perform action recognition in different scenarios using varying prerequisites are presented. The prerequisites have decreasing levels of difficulty to obtain: 1) Scenario requires human detection and tracking to perform action recognition; 2) Scenario requires background and foreground separation to perform action recognition; and 3) No pre-processing is required for action recognition.First, we propose a new video representation using optical flow and particle advection. The proposed ``Particle Flow Field'' (PFF) representation has been used to generate motion descriptors and tested in a Bag of Video Words (BoVW) framework on the KTH dataset. We show that particle flow fields has better performance than other low-level video representations, such as 2D-Gradients, 3D-Gradients and optical flow. Second, we analyze the performance of the state-of-the-art technique based on the histogram of oriented 3D-Gradients in spatio temporal volumes, where human detection and tracking are required. We use the proposed particle flow field and show superior results compared to the histogram of oriented 3D-Gradients in spatio temporal volumes. The proposed method, when used for human action recognition, just needs human detection and does not necessarily require human tracking and figure centric bounding boxes. It has been tested on KTH (six actions), Weizmann (ten actions), and IXMAS (thirteen actions, 4 different views) action recognition datasets.Third, we propose using the scene context information obtained from moving and stationary pixels in the key frames, in conjunction with motion descriptors obtained using Bag of Words framework, to solve the action recognition problem on a large (50 actions) dataset with videos from the web. We perform a combination of early and late fusion on multiple features to handle the huge number of categories. We demonstrate that scene context is a very important feature for performing action recognition on huge datasets.The proposed method needs separation of moving and stationary pixels, and does not require any kind of video stabilization, person detection, or tracking and pruning of features. Our approach obtains good performance on a huge number of action categories. It has been tested on the UCF50 dataset with 50 action categories, which is an extension of the UCF YouTube Action (UCF11) Dataset containing 11 action categories. We also tested our approach on the KTH and HMDB51 datasets for comparison.Finally, we focus on solving practice problems in representing actions by bag of spatio temporal features (i.e. cuboids), which has proven valuable for action recognition in recent literature. We observed that the visual vocabulary based (bag of video words) method suffers from many drawbacks in practice, such as: (i) It requires an intensive training stage to obtain good performance; (ii) it is sensitive to the vocabulary size; (iii) it is unable to cope with incremental recognition problems; (iv) it is unable to recognize simultaneous multiple actions; (v) it is unable to perform recognition frame by frame. In order to overcome these drawbacks, we propose a framework to index large scale motion features using Sphere/Rectangle-tree (SR-tree) for incremental action detection and recognition. The recognition comprises of the following two steps: 1) recognizing the local features by non-parametric nearest neighbor (NN), and 2) using a simple voting strategy to label the action. It can also provide localization of the action. Since it does not require feature quantization it can efficiently grow the feature-tree by adding features from new training actions or categories. Our method provides an effective way for practical incremental action recognition. Furthermore, it can handle large scale datasets because the SR-tree is a disk-based data structure. We tested our approach on two publicly available datasets, the KTH dataset and the IXMAS multi-view dataset, and achieved promising results.
Show less - Date Issued
- 2012
- Identifier
- CFE0004626, ucf:49923
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004626
- Title
- LEARNING SEMANTIC FEATURES FOR VISUAL RECOGNITION.
- Creator
-
Liu, Jingen, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Visual recognition (e.g., object, scene and action recognition) is an active area of research in computer vision due to its increasing number of real-world applications such as video (image) indexing and search, intelligent surveillance, human-machine interaction, robot navigation, etc. Effective modeling of the objects, scenes and actions is critical for visual recognition. Recently, bag of visual words (BoVW) representation, in which the image patches or video cuboids are quantized into...
Show moreVisual recognition (e.g., object, scene and action recognition) is an active area of research in computer vision due to its increasing number of real-world applications such as video (image) indexing and search, intelligent surveillance, human-machine interaction, robot navigation, etc. Effective modeling of the objects, scenes and actions is critical for visual recognition. Recently, bag of visual words (BoVW) representation, in which the image patches or video cuboids are quantized into visual words (i.e., mid-level features) based on their appearance similarity using clustering, has been widely and successfully explored. The advantages of this representation are: no explicit detection of objects or object parts and their tracking are required; the representation is somewhat tolerant to within-class deformations, and it is e±cient for matching. However, the performance of the BoVW is sensitive to the size of the visual vocabulary. Therefore, computationally expensive cross-validation is needed to find the appropriate quantization granularity. This limitation is partially due to the fact that the visual words are not semantically meaningful. This limits the effectiveness and compactness of the representation. To overcome these shortcomings, in this thesis we present principled approach to learn a semantic vocabulary (i.e. high-level features) from a large amount of visual words (mid-level features). In this context, the thesis makes two major contributions. First, we have developed an algorithm to discover a compact yet discriminative semantic vocabulary. This vocabulary is obtained by grouping the visual-words based on their distribution in videos (images) into visual-word clusters. The mutual information (MI) be- tween the clusters and the videos (images) depicts the discriminative power of the semantic vocabulary, while the MI between visual-words and visual-word clusters measures the compactness of the vocabulary. We apply the information bottleneck (IB) algorithm to find the optimal number of visual-word clusters by ¯nding the good tradeoff between compactness and discriminative power. We tested our proposed approach on the state-of-the-art KTH dataset, and obtained average accuracy of 94.2%. However, this approach performs one-side clustering, because only visual words are clustered regardless of which video they appear in. In order to leverage the co-occurrence of visual words and images, we have developed the co-clustering algorithm to simultaneously group the visual words and images. We tested our approach on the publicly available ¯fteen scene dataset and have obtained about 4% increase in the average accuracy compared to the one side clustering approaches. Second, instead of grouping the mid-level features, we first embed the features into a low-dimensional semantic space by manifold learning, and then perform the clustering. We apply Diffusion Maps (DM) to capture the local geometric structure of the mid-level feature space. The DM embedding is able to preserve the explicitly defined diffusion distance, which reflects the semantic similarity between any two features. Furthermore, the DM provides multi-scale analysis capability by adjusting the time steps in the Markov transition matrix. The experiments on KTH dataset show that DM can perform much better (about 3% to 6% improvement in average accuracy) than other manifold learning approaches and IB method. Above methods use only single type of features. In order to combine multiple heterogeneous features for visual recognition, we further propose the Fielder Embedding to capture the complicated semantic relationships between all entities (i.e., videos, images,heterogeneous features). The discovered relationships are then employed to further increase the recognition rate. We tested our approach on Weizmann dataset, and achieved about 17% 21% improvements in the average accuracy.
Show less - Date Issued
- 2009
- Identifier
- CFE0002936, ucf:47961
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002936
- Title
- MULTIZOOM ACTIVITY RECOGNITION USING MACHINE LEARNING.
- Creator
-
Smith, Raymond, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this thesis we present a system for detection of events in video. First a multiview approach to automatically detect and track heads and hands in a scene is described. Then, by making use of epipolar, spatial, trajectory, and appearance constraints, objects are labeled consistently across cameras (zooms). Finally, we demonstrate a new machine learning paradigm, TemporalBoost, that can recognize events in video. One aspect of any machine learning algorithm is in the feature set used. The...
Show moreIn this thesis we present a system for detection of events in video. First a multiview approach to automatically detect and track heads and hands in a scene is described. Then, by making use of epipolar, spatial, trajectory, and appearance constraints, objects are labeled consistently across cameras (zooms). Finally, we demonstrate a new machine learning paradigm, TemporalBoost, that can recognize events in video. One aspect of any machine learning algorithm is in the feature set used. The approach taken here is to build a large set of activity features, though TemporalBoost itself is able to work with any feature set other boosting algorithms use. We also show how multiple levels of zoom can cooperate to solve problems related to activity recognition.
Show less - Date Issued
- 2005
- Identifier
- CFE0000865, ucf:46658
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000865
- Title
- Action Recognition, Temporal Localization and Detection in Trimmed and Untrimmed Video.
- Creator
-
Hou, Rui, Shah, Mubarak, Mahalanobis, Abhijit, Hua, Kien, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
Automatic understanding of videos is one of the most active areas of computer vision research. It has applications in video surveillance, human computer interaction, video sports analysis, virtual and augmented reality, video retrieval etc. In this dissertation, we address four important tasks in video understanding, namely action recognition, temporal action localization, spatial-temporal action detection and video object/action segmentation. This dissertation makes contributions to above...
Show moreAutomatic understanding of videos is one of the most active areas of computer vision research. It has applications in video surveillance, human computer interaction, video sports analysis, virtual and augmented reality, video retrieval etc. In this dissertation, we address four important tasks in video understanding, namely action recognition, temporal action localization, spatial-temporal action detection and video object/action segmentation. This dissertation makes contributions to above tasks by proposing. First, for video action recognition, we propose a category level feature learning method. Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, and a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. Second, for temporal action localization, we propose to exploit the temporal structure of actions by modeling an action as a sequence of sub-actions and present a computationally efficient approach. Third, we propose 3D Tube Convolutional Neural Network (TCNN) based pipeline for action detection. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. It generalizes the popular faster R-CNN framework from images to videos. Last, an end-to-end encoder-decoder based 3D convolutional neural network pipeline is proposed, which is able to segment out the foreground objects from the background. Moreover, the action label can be obtained as well by passing the foreground object into an action classifier. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for video understanding compared to the state-of-the-art.
Show less - Date Issued
- 2019
- Identifier
- CFE0007655, ucf:52502
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007655
- Title
- Exploring sparsity, self-similarity, and low rank approximation in action recognition, motion retrieval, and action spotting.
- Creator
-
Sun, Chuan, Foroosh, Hassan, Hughes, Charles, Tappen, Marshall, Sukthankar, Rahul, Moshell, Jack, University of Central Florida
- Abstract / Description
-
This thesis consists of $4$ major parts. In the first part (Chapters $1-2$), we introduce the overview, motivation, and contribution of our works, and extensively survey the current literature for $6$ related topics. In the second part (Chapters $3-7$), we explore the concept of ``Self-Similarity" in two challenging scenarios, namely, the Action Recognition and the Motion Retrieval. We build three-dimensional volume representations for both scenarios, and devise effective techniques that can...
Show moreThis thesis consists of $4$ major parts. In the first part (Chapters $1-2$), we introduce the overview, motivation, and contribution of our works, and extensively survey the current literature for $6$ related topics. In the second part (Chapters $3-7$), we explore the concept of ``Self-Similarity" in two challenging scenarios, namely, the Action Recognition and the Motion Retrieval. We build three-dimensional volume representations for both scenarios, and devise effective techniques that can produce compact representations encoding the internal dynamics of data. In the third part (Chapter $8$), we explore the challenging action spotting problem, and propose a feature-independent unsupervised framework that is effective in spotting action under various real situations, even under heavily perturbed conditions. The final part (Chapters $9$) is dedicated to conclusions and future works.For action recognition, we introduce a generic method that does not depend on one particular type of input feature vector. We make three main contributions: (i) We introduce the concept of Joint Self-Similarity Volume (Joint SSV) for modeling dynamical systems, and show that by using a new optimized rank-1 tensor approximation of Joint SSV one can obtain compact low-dimensional descriptors that very accurately preserve the dynamics of the original system, e.g. an action video sequence; (ii) The descriptor vectors derived from the optimized rank-1 approximation make it possible to recognize actions without explicitly aligning the action sequences of varying speed of execution or difference frame rates; (iii) The method is generic and can be applied using different low-level features such as silhouettes, histogram of oriented gradients (HOG), etc. Hence, it does not necessarily require explicit tracking of features in the space-time volume. Our experimental results on five public datasets demonstrate that our method produces very good results and outperforms many baseline methods.For action recognition for incomplete videos, we determine whether incomplete videos that are often discarded carry useful information for action recognition, and if so, how one can represent such mixed collection of video data (complete versus incomplete, and labeled versus unlabeled) in a unified manner. We propose a novel framework to handle incomplete videos in action classification, and make three main contributions: (i) We cast the action classification problem for a mixture of complete and incomplete data as a semi-supervised learning problem of labeled and unlabeled data. (ii) We introduce a two-step approach to convert the input mixed data into a uniform compact representation. (iii) Exhaustively scrutinizing $280$ configurations, we experimentally show on our two created benchmarks that, even when the videos are extremely sparse and incomplete, it is still possible to recover useful information from them, and classify unknown actions by a graph based semi-supervised learning framework.For motion retrieval, we present a framework that allows for a flexible and an efficient retrieval of motion capture data in huge databases. The method first converts an action sequence into a self-similarity matrix (SSM), which is based on the notion of self-similarity. This conversion of the motion sequences into compact and low-rank subspace representations greatly reduces the spatiotemporal dimensionality of the sequences. The SSMs are then used to construct order-3 tensors, and we propose a low-rank decomposition scheme that allows for converting the motion sequence volumes into compact lower dimensional representations, without losing the nonlinear dynamics of the motion manifold. Thus, unlike existing linear dimensionality reduction methods that distort the motion manifold and lose very critical and discriminative components, the proposed method performs well, even when inter-class differences are small or intra-class differences are large. In addition, the method allows for an efficient retrieval and does not require the time-alignment of the motion sequences. We evaluate the performance of our retrieval framework on the CMU mocap dataset under two experimental settings, both demonstrating very good retrieval rates.For action spotting, our framework does not depend on any specific feature (e.g. HOG/HOF, STIP, silhouette, bag-of-words, etc.), and requires no human localization, segmentation, or framewise tracking. This is achieved by treating the problem holistically as that of extracting the internal dynamics of video cuboids by modeling them in their natural form as multilinear tensors. To extract their internal dynamics, we devised a novel Two-Phase Decomposition (TP-Decomp) of a tensor that generates very compact and discriminative representations that are robust to even heavily perturbed data. Technically, a Rank-based Tensor Core Pyramid (Rank-TCP) descriptor is generated by combining multiple tensor cores under multiple ranks, allowing to represent video cuboids in a hierarchical tensor pyramid. The problem then reduces to a template matching problem, which is solved efficiently by using two boosting strategies: (i) to reduce the search space, we filter the dense trajectory cloud extracted from the target video; (ii) to boost the matching speed, we perform matching in an iterative coarse-to-fine manner. Experiments on 5 benchmarks show that our method outperforms current state-of-the-art under various challenging conditions. We also created a challenging dataset called Heavily Perturbed Video Arrays (HPVA) to validate the robustness of our framework under heavily perturbed situations.
Show less - Date Issued
- 2014
- Identifier
- CFE0005554, ucf:50290
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005554
- Title
- STUDY OF HUMAN ACTIVITY IN VIDEO DATA WITH AN EMPHASIS ON VIEW-INVARIANCE.
- Creator
-
Ashraf, Nazim, Foroosh, Hassan, Hughes, Charles, Tappen, Marshall, Moshell, Jack, University of Central Florida
- Abstract / Description
-
The perception and understanding of human motion and action is an important area of research in computer vision that plays a crucial role in various applications such as surveillance, HCI, ergonomics, etc. In this thesis, we focus on the recognition of actions in the case of varying viewpoints and different and unknown camera intrinsic parameters. The challenges to be addressed include perspective distortions, differences in viewpoints, anthropometric variations,and the large degrees of...
Show moreThe perception and understanding of human motion and action is an important area of research in computer vision that plays a crucial role in various applications such as surveillance, HCI, ergonomics, etc. In this thesis, we focus on the recognition of actions in the case of varying viewpoints and different and unknown camera intrinsic parameters. The challenges to be addressed include perspective distortions, differences in viewpoints, anthropometric variations,and the large degrees of freedom of articulated bodies. In addition, we are interested in methods that require little or no training. The current solutions to action recognition usually assume that there is a huge dataset of actions available so that a classifier can be trained. However, thismeans that in order to define a new action, the user has to record a number of videos fromdifferent viewpoints with varying camera intrinsic parameters and then retrain the classifier, which is not very practical from a development point of view. We propose algorithms that overcome these challenges and require just a few instances of the action from any viewpoint with any intrinsic camera parameters. Our first algorithm is based on the rank constraint on the family of planar homographies associated with triplets of body points. We represent action as a sequence of poses, and decompose the pose into triplets. Therefore, the pose transition is brokendown into a set of movement of body point planes. In this way, we transform the non-rigid motion of the body points into a rigid motion of body point planes. We use the fact that the family of homographies associated with two identical poses would have rank 4 to gauge similarity of the pose between two subjects, observed by different perspective cameras and from different viewpoints. This method requires only one instance of the action. We then show that it is possible to extend the concept of triplets to line segments. In particular, we establish that if we look at the movement of line segments instead of triplets, we have more redundancy in data thus leading to better results. We demonstrate this concept on (")fundamental ratios.(") We decompose a human body pose into line segments instead of triplets and look at set of movement of line segments. This method needs only three instances of the action. If a larger dataset is available, we can also apply weighting on line segments for better accuracy. The last method is based onthe concept of (")Projective Depth("). Given a plane, we can find the relative depth of a point relative to the given plane. We propose three different ways of using (")projective depth:(") (i) Triplets - the three points of a triplet along with the epipole defines the plane and the movement of points relative to these body planes can be used to recognize actions; (ii) Ground plane - if we are able to extract the ground plane, we can find the (")projective depth(") of the body points withrespect to it. Therefore, the problem of action recognition would translate to curve matching; and (iii) Mirror person (-) We can use the mirror view of the person to extract mirror symmetric planes. This method also needs only one instance of the action. Extensive experiments are reported on testing view invariance, robustness to noisy localization and occlusions of bodypoints, and action recognition. The experimental results are very promising and demonstrate the efficiency of our proposed invariants.
Show less - Date Issued
- 2012
- Identifier
- CFE0004352, ucf:49449
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004352
- Title
- SPATIO-TEMPORAL MAXIMUM AVERAGE CORRELATION HEIGHT TEMPLATES IN ACTION RECOGNITION AND VIDEO SUMMARIZATION.
- Creator
-
Rodriguez, Mikel, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Action recognition represents one of the most difficult problems in computer vision given that it embodies the combination of several uncertain attributes, such as the subtle variability associated with individual human behavior and the challenges that come with viewpoint variations, scale changes and different temporal extents. Nevertheless, action recognition solutions are critical in a great number of domains, such video surveillance, assisted living environments, video search, interfaces,...
Show moreAction recognition represents one of the most difficult problems in computer vision given that it embodies the combination of several uncertain attributes, such as the subtle variability associated with individual human behavior and the challenges that come with viewpoint variations, scale changes and different temporal extents. Nevertheless, action recognition solutions are critical in a great number of domains, such video surveillance, assisted living environments, video search, interfaces, and virtual reality. In this dissertation, we investigate template-based action recognition algorithms that can incorporate the information contained in a set of training examples, and we explore how these algorithms perform in action recognition and video summarization. First, we introduce a template-based method for recognizing human actions called Action MACH. Our approach is based on a Maximum Average Correlation Height (MACH) filter. MACH is capable of capturing intra-class variability by synthesizing a single Action MACH filter for a given action class. We generalize the traditional MACH filter to video (3D spatiotemporal volume), and vector valued data. By analyzing the response of the filter in the frequency domain, we avoid the high computational cost commonly incurred in template-based approaches. Vector valued data is analyzed using the Clifford Fourier transform, a generalization of the Fourier transform intended for both scalar and vector-valued data. Next, we address three seldom explored challenges in template-based action recognition. The first is the recognition and localization of human actions in aerial videos obtained from unmanned aerial vehicles (UAVs), a new medium which presents unique challenges due to the small number of pixels per human, pose, and moving camera. The second issue we address is the incorporation of multiple positive and negative examples of a target action class when generating an action template. We address this issue by employing the Fukunaga-Koontz Transform as a means of generating a single quadratic template which, unlike traditional temporal templates (which rely on positive examples alone), effectively captures the variability associated with an action class by including both positive and negative examples in the template training process. Third, we explore the problem of generating video summaries that include specific actions of interest as opposed to all moving objects. In doing so, we explore the role of action templates in video summarization in an effort to provide a means of generating a compact video representation based on a set of activities of interest. We introduce an approach in which a user specifies the activities that interest him and the video is automatically condensed to a short clip which captures the most relevant events based on the user's preference. We follow the output summary video format of non-chronological video synopsis approaches, in which different events which occur at different times may be displayed concurrently, even though they never occur simultaneously in the original video. However, instead of assuming that all moving objects are interesting, priority is given to specific activities of interest which pertain to a user's query. This provides an efficient means of browsing through large collections of video for events of interest.
Show less - Date Issued
- 2010
- Identifier
- CFE0003313, ucf:48507
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003313
- Title
- Online, Supervised and Unsupervised Action Localization in Videos.
- Creator
-
Soomro, Khurram, Shah, Mubarak, Heinrich, Mark, Hu, Haiyan, Bagci, Ulas, Yun, Hae-Bum, University of Central Florida
- Abstract / Description
-
Action recognition classifies a given video among a set of action labels, whereas action localization determines the location of an action in addition to its class. The overall aim of this dissertation is action localization. Many of the existing action localization approaches exhaustively search (spatially and temporally) for an action in a video. However, as the search space increases with high resolution and longer duration videos, it becomes impractical to use such sliding window...
Show moreAction recognition classifies a given video among a set of action labels, whereas action localization determines the location of an action in addition to its class. The overall aim of this dissertation is action localization. Many of the existing action localization approaches exhaustively search (spatially and temporally) for an action in a video. However, as the search space increases with high resolution and longer duration videos, it becomes impractical to use such sliding window techniques. The first part of this dissertation presents an efficient approach for localizing actions by learning contextual relations between different video regions in training. In testing, we use the context information to estimate the probability of each supervoxel belonging to the foreground action and use Conditional Random Field (CRF) to localize actions. In the above method and typical approaches to this problem, localization is performed in an offline manner where all the video frames are processed together. This prevents timely localization and prediction of actions/interactions - an important consideration for many tasks including surveillance and human-machine interaction. Therefore, in the second part of this dissertation we propose an online approach to the challenging problem of localization and prediction of actions/interactions in videos. In this approach, we use human poses and superpixels in each frame to train discriminative appearance models and perform online prediction of actions/interactions with Structural SVM. Above two approaches rely on human supervision in the form of assigning action class labels to videos and annotating actor bounding boxes in each frame of training videos. Therefore, in the third part of this dissertation we address the problem of unsupervised action localization. Given unlabeled videos without annotations, this approach aims at: 1) Discovering action classes using a discriminative clustering approach, and 2) Localizing actions using a variant of Knapsack problem.
Show less - Date Issued
- 2017
- Identifier
- CFE0006917, ucf:51685
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006917
- Title
- MULTI-VIEW GEOMETRIC CONSTRAINTS FOR HUMAN ACTION RECOGNITION AND TRACKING.
- Creator
-
GRITAI, ALEXEI, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Human actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints...
Show moreHuman actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints of two actors in the same posture can be described by a 3D rigid transformation. This transformation simultaneously captures different poses and various sizes and proportions. As a consequence of this conjecture, we show that there exists a fundamental matrix between the imaged positions of the body joints of two actors, if they are in the same posture. Second, we propose a novel projection model for cameras moving at a constant velocity in 3D space, \emph cameras, and derive the Galilean fundamental matrix and apply it to human action recognition. Third, we propose a novel use for the invariant ratio of areas under an affine transformation and utilizing the epipolar geometry between two cameras for 2D model-based tracking of human body joints. In the first part of the thesis, we propose an approach to match human actions using semantic correspondences between human bodies. These correspondences are used to provide geometric constraints between multiple anatomical landmarks ( e.g. hands, shoulders, and feet) to match actions observed from different viewpoints and performed at different rates by actors of differing anthropometric proportions. The fact that the human body has approximate anthropometric proportion allows for innovative use of the machinery of epipolar geometry to provide constraints for analyzing actions performed by people of different anthropometric sizes, while ensuring that changes in viewpoint do not affect matching. A novel measure in terms of rank of matrix constructed only from image measurements of the locations of anatomical landmarks is proposed to ensure that similar actions are accurately recognized. Finally, we describe how dynamic time warping can be used in conjunction with the proposed measure to match actions in the presence of nonlinear time warps. We demonstrate the versatility of our algorithm in a number of challenging sequences and applications including action synchronization , odd one out, following the leader, analyzing periodicity etc. Next, we extend the conventional model of image projection to video captured by a camera moving at constant velocity. We term such moving camera Galilean camera. To that end, we derive the spacetime projection and develop the corresponding epipolar geometry between two Galilean cameras. Both perspective imaging and linear pushbroom imaging form specializations of the proposed model and we show how six different ``fundamental" matrices including the classic fundamental matrix, the Linear Pushbroom (LP) fundamental matrix, and a fundamental matrix relating Epipolar Plane Images (EPIs) are related and can be directly recovered from a Galilean fundamental matrix. We provide linear algorithms for estimating the parameters of the the mapping between videos in the case of planar scenes. For applying fundamental matrix between Galilean cameras to human action recognition, we propose a measure that has two important properties. First property makes it possible to recognize similar actions, if their execution rates are linearly related. Second property allows recognizing actions in video captured by Galilean cameras. Thus, the proposed algorithm guarantees that actions can be correctly matched despite changes in view, execution rate, anthropometric proportions of the actor, and even if the camera moves with constant velocity. Finally, we also propose a novel 2D model based approach for tracking human body parts during articulated motion. The human body is modeled as a 2D stick figure of thirteen body joints and an action is considered as a sequence of these stick figures. Given the locations of these joints in every frame of a model video and the first frame of a test video, the joint locations are automatically estimated throughout the test video using two geometric constraints. First, invariance of the ratio of areas under an affine transformation is used for initial estimation of the joint locations in the test video. Second, the epipolar geometry between the two cameras is used to refine these estimates. Using these estimated joint locations, the tracking algorithm determines the exact location of each landmark in the test video using the foreground silhouettes. The novelty of the proposed approach lies in the geometric formulation of human action models, the combination of the two geometric constraints for body joints prediction, and the handling of deviations in anthropometry of individuals, viewpoints, execution rate, and style of performing action. The proposed approach does not require extensive training and can easily adapt to a wide variety of articulated actions.
Show less - Date Issued
- 2007
- Identifier
- CFE0001692, ucf:47199
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001692
- Title
- Weakly Labeled Action Recognition and Detection.
- Creator
-
Sultani, Waqas, Shah, Mubarak, Bagci, Ulas, Qi, GuoJun, Yun, Hae-Bum, University of Central Florida
- Abstract / Description
-
Research in human action recognition strives to develop increasingly generalized methods thatare robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendousstrides in improving recognition accuracy on ever larger and complex benchmark datasets, comprisingrealistic actions (")in the wild(") videos. Unfortunately, the all-encompassing, dense, globalrepresentations that bring about such improvements often benefit from the inherent characteristics,specific to...
Show moreResearch in human action recognition strives to develop increasingly generalized methods thatare robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendousstrides in improving recognition accuracy on ever larger and complex benchmark datasets, comprisingrealistic actions (")in the wild(") videos. Unfortunately, the all-encompassing, dense, globalrepresentations that bring about such improvements often benefit from the inherent characteristics,specific to datasets and classes, that do not necessarily reflect knowledge about the entity to berecognized. This results in specific models that perform well within datasets but generalize poorly.Furthermore, training of supervised action recognition and detection methods need several precisespatio-temporal manual annotations to achieve good recognition and detection accuracy. For instance,current deep learning architectures require millions of accurately annotated videos to learnrobust action classifiers. However, these annotations are quite difficult to achieve.In the first part of this dissertation, we explore the reasons for poor classifier performance whentested on novel datasets, and quantify the effect of scene backgrounds on action representationsand recognition. We attempt to address the problem of recognizing human actions while trainingand testing on distinct datasets when test videos are neither labeled nor available during training. Inthis scenario, learning of a joint vocabulary, or domain transfer techniques are not applicable. Weperform different types of partitioning of the GIST feature space for several datasets and computemeasures of background scene complexity, as well as, for the extent to which scenes are helpfulin action classification. We then propose a new process to obtain a measure of confidence in eachpixel of the video being a foreground region using motion, appearance, and saliency together in a3D-Markov Random Field (MRF) based framework. We also propose multiple ways to exploit theforeground confidence: to improve bag-of-words vocabulary, histogram representation of a video,and a novel histogram decomposition based representation and kernel.iiiThe above-mentioned work provides probability of each pixel being belonging to the actor, however,it does not give the precise spatio-temporal location of the actor. Furthermore, above frameworkwould require precise spatio-temporal manual annotations to train an action detector. However,manual annotations in videos are laborious, require several annotators and contain humanbiases. Therefore, in the second part of this dissertation, we propose a weakly labeled approachto automatically obtain spatio-temporal annotations of actors in action videos. We first obtain alarge number of action proposals in each video. To capture a few most representative action proposalsin each video and evade processing thousands of them, we rank them using optical flow andsaliency in a 3D-MRF based framework and select a few proposals using MAP based proposal subsetselection method. We demonstrate that this ranking preserves the high-quality action proposals.Several such proposals are generated for each video of the same action. Our next challenge is toiteratively select one proposal from each video so that all proposals are globally consistent. Weformulate this as Generalized Maximum Clique Graph problem (GMCP) using shape, global andfine-grained similarity of proposals across the videos. The output of our method is the most actionrepresentative proposals from each video. Using our method can also annotate multiple instancesof the same action in a video can also be annotated. Moreover, action detection experiments usingannotations obtained by our method and several baselines demonstrate the superiority of ourapproach.The above-mentioned annotation method uses multiple videos of the same action. Therefore, inthe third part of this dissertation, we tackle the problem of spatio-temporal action localization in avideo, without assuming the availability of multiple videos or any prior annotations. The action islocalized by employing images downloaded from the Internet using action label. Given web images,we first dampen image noise using random walk and evade distracting backgrounds withinimages using image action proposals. Then, given a video, we generate multiple spatio-temporalaction proposals. We suppress camera and background generated proposals by exploiting opticalivflow gradients within proposals. To obtain the most action representative proposals, we propose toreconstruct action proposals in the video by leveraging the action proposals in images. Moreover,we preserve the temporal smoothness of the video and reconstruct all proposal bounding boxesjointly using the constraints that push the coefficients for each bounding box toward a commonconsensus, thus enforcing the coefficient similarity across multiple frames. We solve this optimizationproblem using the variant of two-metric projection algorithm. Finally, the video proposalthat has the lowest reconstruction cost and is motion salient is used to localize the action. Ourmethod is not only applicable to the trimmed videos, but it can also be used for action localizationin untrimmed videos, which is a very challenging problem.Finally, in the third part of this dissertation, we propose a novel approach to generate a few properlyranked action proposals from a large number of noisy proposals. The proposed approach beginswith dividing each proposal into sub-proposals. We assume that the quality of proposal remainsthe same within each sub-proposal. We, then employ a graph optimization method to recombinethe sub-proposals in all action proposals in a single video in order to optimally build new actionproposals and rank them by the combined node and edge scores. For an untrimmed video, we firstdivide the video into shots and then make the above-mentioned graph within each shot. Our methodgenerates a few ranked proposals that can be better than all the existing underlying proposals. Ourexperimental results validated that the properly ranked action proposals can significantly boostaction detection results.Our extensive experimental results on different challenging and realistic action datasets, comparisonswith several competitive baselines and detailed analysis of each step of proposed methodsvalidate the proposed ideas and frameworks.
Show less - Date Issued
- 2017
- Identifier
- CFE0006801, ucf:51809
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006801
- Title
- Holistic Representations for Activities and Crowd Behaviors.
- Creator
-
Solmaz, Berkan, Shah, Mubarak, Da Vitoria Lobo, Niels, Jha, Sumit, Ilie, Marcel, Moore, Brian, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of analyzing the activities of people in a variety of scenarios, this is commonly encountered in vision applications. The overarching goal is to devise new representations for the activities, in settings where individuals or a number of people may take a part in specific activities. Different types of activities can be performed by either an individual at the fine level or by several people constituting a crowd at the coarse level. We take into...
Show moreIn this dissertation, we address the problem of analyzing the activities of people in a variety of scenarios, this is commonly encountered in vision applications. The overarching goal is to devise new representations for the activities, in settings where individuals or a number of people may take a part in specific activities. Different types of activities can be performed by either an individual at the fine level or by several people constituting a crowd at the coarse level. We take into account the domain specific information for modeling these activities. The summary of the proposed solutions is presented in the following.The holistic description of videos is appealing for visual detection and classification tasks for several reasons including capturing the spatial relations between the scene components, simplicity, and performance [1, 2, 3]. First, we present a holistic (global) frequency spectrum based descriptor for representing the atomic actions performed by individuals such as: bench pressing, diving, hand waving, boxing, playing guitar, mixing, jumping, horse riding, hula hooping etc. We model and learn these individual actions for classifying complex user uploaded videos. Our method bypasses the detection of interest points, the extraction of local video descriptors and the quantization of local descriptors into a code book; it represents each video sequence as a single feature vector. This holistic feature vector is computed by applying a bank of 3-D spatio-temporal filters on the frequency spectrum of a video sequence; hence it integrates the information about the motion and scene structure. We tested our approach on two of the most challenging datasets, UCF50 [4] and HMDB51 [5], and obtained promising results which demonstrates the robustness and the discriminative power of our holistic video descriptor for classifying videos of various realistic actions.In the above approach, a holistic feature vector of a video clip is acquired by dividing the video into spatio-temporal blocks then concatenating the features of the individual blocks together. However, such a holistic representation blindly incorporates all the video regions regardless of their contribution in classification. Next, we present an approach which improves the performance of the holistic descriptors for activity recognition. In our novel method, we improve the holistic descriptors by discovering the discriminative video blocks. We measure the discriminativity of a block by examining its response to a pre-learned support vector machine model. In particular, a block is considered discriminative if it responds positively for positive training samples, and negatively for negative training samples. We pose the problem of finding the optimal blocks as a problem of selecting a sparse set of blocks, which maximizes the total classifier discriminativity. Through a detailed set of experiments on benchmark datasets [6, 7, 8, 9, 5, 10], we show that our method discovers the useful regions in the videos and eliminates the ones which are confusing for classification, which results in significant performance improvement over the state-of-the-art.In contrast to the scenes where an individual performs a primitive action, there may be scenes with several people, where crowd behaviors may take place. For these types of scenes the traditional approaches for recognition will not work due to severe occlusion and computational requirements. The number of videos is limited and the scenes are complicated, hence learning these behaviors is not feasible. For this problem, we present a novel approach, based on the optical flow in a video sequence, for identifying five specific and common crowd behaviors in visual scenes. In the algorithm, the scene is overlaid by a grid of particles, initializing a dynamical system which is derived from the optical flow. Numerical integration of the optical flow provides particle trajectories that represent the motion in the scene. Linearization of the dynamical system allows a simple and practical analysis and classification of the behavior through the Jacobian matrix. Essentially, the eigenvalues of this matrix are used to determine the dynamic stability of points in the flow and each type of stability corresponds to one of the five crowd behaviors. The identified crowd behaviors are (1) bottlenecks: where many pedestrians/vehicles from various points in the scene are entering through one narrow passage, (2) fountainheads: where many pedestrians/vehicles are emerging from a narrow passage only to separate in many directions, (3) lanes: where many pedestrians/vehicles are moving at the same speeds in the same direction, (4) arches or rings: where the collective motion is curved or circular, and (5) blocking: where there is a opposing motion and desired movement of groups of pedestrians is somehow prohibited. The implementation requires identifying a region of interest in the scene, and checking the eigenvalues of the Jacobian matrix in that region to determine the type of flow, that corresponds to various well-defined crowd behaviors. The eigenvalues are only considered in these regions of interest, consistent with the linear approximation and the implied behaviors. Since changes in eigenvalues can mean changes in stability, corresponding to changes in behavior, we can repeat the algorithm over clips of long video sequences to locate changes in behavior. This method was tested on over real videos representing crowd and traffic scenes.
Show less - Date Issued
- 2013
- Identifier
- CFE0004941, ucf:49638
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004941
- Title
- Human Action Localization and Recognition in Unconstrained Videos.
- Creator
-
Boyraz, Hakan, Tappen, Marshall, Foroosh, Hassan, Lin, Mingjie, Zhang, Shaojie, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
As imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing...
Show moreAs imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing discriminative sub-regions of images and videos when performing recognition tasks. In this thesis, we address the action detection and recognition problems. Action detection in video is a particularly difficult problem because actions must not only be recognized correctly, but must also be localized in the 3D spatio-temporal volume. We introduce a technique that transforms the 3D localization problem into a series of 2D detection tasks. This is accomplished by dividing the video into overlapping segments, then representing each segment with a 2D video projection. The advantage of the 2D projection is that it makes it convenient to apply the best techniques from object detection to the action detection problem. We also introduce a novel, straightforward method for searching the 2D projections to localize actions, termed Two-Point Subwindow Search (TPSS). Finally, we show how to connect the local detections in time using a chaining algorithm to identify the entire extent of the action. Our experiments show that video projection outperforms the latest results on action detection in a direct comparison.Second, we present a probabilistic model learning to identify discriminative regions in videos from weakly-supervised data where each video clip is only assigned a label describing what action is present in the frame or clip. While our first system requires every action to be manually outlined in every frame of the video, this second system only requires that the video be given a single high-level tag. From this data, the system is able to identify discriminative regions that correspond well to the regions containing the actual actions. Our experiments on both the MSR Action Dataset II and UCF Sports Dataset show that the localizations produced by this weakly supervised system are comparable in quality to localizations produced by systems that require each frame to be manually annotated. This system is able to detect actions in both 1) non-temporally segmented action videos and 2) recognition tasks where a single label is assigned to the clip. We also demonstrate the action recognition performance of our method on two complex datasets, i.e. HMDB and UCF101. Third, we extend our weakly-supervised framework by replacing the recognition stage with a two-stage neural network and apply dropout for preventing overfitting of the parameters on the training data. Dropout technique has been recently introduced to prevent overfitting of the parameters in deep neural networks and it has been applied successfully to object recognition problem. To our knowledge, this is the first system using dropout for action recognition problem. We demonstrate that using dropout improves the action recognition accuracies on HMDB and UCF101 datasets.
Show less - Date Issued
- 2013
- Identifier
- CFE0004977, ucf:49562
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004977
- Title
- GEOMETRIC INVARIANCE IN THE ANALYSIS OF HUMAN MOTION IN VIDEO DATA.
- Creator
-
Shen, Yuping, Foroosh, Hassan, University of Central Florida
- Abstract / Description
-
Human motion analysis is one of the major problems in computer vision research. It deals with the study of the motion of human body in video data from different aspects, ranging from the tracking of body parts and reconstruction of 3D human body configuration, to higher level of interpretation of human action and activities in image sequences. When human motion is observed through video camera, it is perspectively distorted and may appear totally different from different viewpoints. Therefore...
Show moreHuman motion analysis is one of the major problems in computer vision research. It deals with the study of the motion of human body in video data from different aspects, ranging from the tracking of body parts and reconstruction of 3D human body configuration, to higher level of interpretation of human action and activities in image sequences. When human motion is observed through video camera, it is perspectively distorted and may appear totally different from different viewpoints. Therefore it is highly challenging to establish correct relationships between human motions across video sequences with different camera settings. In this work, we investigate the geometric invariance in the motion of human body, which is critical to accurately understand human motion in video data regardless of variations in camera parameters and viewpoints. In human action analysis, the representation of human action is a very important issue, and it usually determines the nature of the solutions, including their limits in resolving the problem. Unlike existing research that study human motion as a whole 2D/3D object or a sequence of postures, we study human motion as a sequence of body pose transitions. We also decompose a human body pose further into a number of body point triplets, and break down a pose transition into the transition of a set of body point triplets. In this way the study of complex non-rigid motion of human body is reduced to that of the motion of rigid body point triplets, i.e. a collection of planes in motion. As a result, projective geometry and linear algebra can be applied to explore the geometric invariance in human motion. Based on this formulation, we have discovered the fundamental ratio invariant and the eigenvalue equality invariant in human motion. We also propose solutions based on these geometric invariants to the problems of view-invariant recognition of human postures and actions, as well as analysis of human motion styles. These invariants and their applicability have been validated by experimental results supporting that their effectiveness in understanding human motion with various camera parameters and viewpoints.
Show less - Date Issued
- 2009
- Identifier
- CFE0002945, ucf:47970
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002945
- Title
- Spatial and Temporal Modeling for Human Activity Recognition from Multimodal Sequential Data.
- Creator
-
Ye, Jun, Hua, Kien, Foroosh, Hassan, Zou, Changchun, Karwowski, Waldemar, University of Central Florida
- Abstract / Description
-
Human Activity Recognition (HAR) has been an intense research area for more than a decade. Different sensors, ranging from 2D and 3D cameras to accelerometers, gyroscopes, and magnetometers, have been employed to generate multimodal signals to detect various human activities. With the advancement of sensing technology and the popularity of mobile devices, depth cameras and wearable devices, such as Microsoft Kinect and smart wristbands, open a unprecedented opportunity to solve the...
Show moreHuman Activity Recognition (HAR) has been an intense research area for more than a decade. Different sensors, ranging from 2D and 3D cameras to accelerometers, gyroscopes, and magnetometers, have been employed to generate multimodal signals to detect various human activities. With the advancement of sensing technology and the popularity of mobile devices, depth cameras and wearable devices, such as Microsoft Kinect and smart wristbands, open a unprecedented opportunity to solve the challenging HAR problem by learning expressive representations from the multimodal signals recording huge amounts of daily activities which comprise a rich set of categories.Although competitive performance has been reported, existing methods focus on the statistical or spatial representation of the human activity sequence;while the internal temporal dynamics of the human activity sequence arenot sufficiently exploited. As a result, they often face the challenge of recognizing visually similar activities composed of dynamic patterns in different temporal order. In addition, many model-driven methods based on sophisticated features and carefully-designed classifiers are computationally demanding and unable to scale to a large dataset. In this dissertation, we propose to address these challenges from three different perspectives; namely, 3D spatial relationship modeling, dynamic temporal quantization, and temporal order encoding.We propose a novel octree-based algorithm for computing the 3D spatial relationships between objects from a 3D point cloud captured by a Kinect sensor. A set of 26 3D spatial directions are defined to describe the spatial relationship of an object with respect to a reference object. These 3D directions are implemented as a set of spatial operators, such as "AboveSouthEast" and "BelowNorthWest," of an event query language to query human activities in an indoor environment; for example, "A person walks in the hallway from north to south." The performance is quantitatively evaluated in a public RGBD object dataset and qualitatively investigated in a live video computing platform.In order to address the challenge of temporal modeling in human action recognition, we introduce the dynamic temporal quantization, a clustering-like algorithm to quantize human action sequences of varied lengths into fixed-size quantized vectors. A two-step optimization algorithm is proposed to jointly optimize the quantization of the original sequence. In the aggregation step, frames falling into the sample segment are aggregated by max-polling and produce the quantized representation of the segment. During the assignment step, frame-segment assignment is updated according to dynamic time warping, while the temporal order of the entire sequence is preserved. The proposed technique is evaluated on three public 3D human action datasets and achieves state-of-the-art performance.Finally, we propose a novel temporal order encoding approach that models the temporal dynamics of the sequential data for human activity recognition. The algorithm encodes the temporal order of the latent patterns extracted by the subspace projection and generates a highly compact First-Take-All (FTA) feature vector representing the entire sequential data. An optimization algorithm is further introduced to learn the optimized projections in order to increase the discriminative power of the FTA feature. The compactness of the FTA feature makes it extremely efficient for human activity recognition with nearest neighbor search based on Hamming distance. Experimental results on two public human activity datasets demonstrate the advantages of the FTA feature over state-of-the-art methods in both accuracy and efficiency.
Show less - Date Issued
- 2016
- Identifier
- CFE0006516, ucf:51367
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006516
- Title
- OBJECT TRACKING AND ACTIVITY RECOGNITION IN VIDEO ACQUIRED USING MOBILE CAMERAS.
- Creator
-
Yilmaz, Alper, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Due to increasing demand on deployable surveillance systems in recent years, object tracking and activity recognition are receiving considerable attention in the research community. This thesis contributes to both the tracking and the activity recognition components of a surveillance system. In particular, for the tracking component, we propose two different approaches for tracking objects in video acquired by mobile cameras, each of which uses a different object shape representation. The...
Show moreDue to increasing demand on deployable surveillance systems in recent years, object tracking and activity recognition are receiving considerable attention in the research community. This thesis contributes to both the tracking and the activity recognition components of a surveillance system. In particular, for the tracking component, we propose two different approaches for tracking objects in video acquired by mobile cameras, each of which uses a different object shape representation. The first approach tracks the centroids of the objects in Forward Looking Infrared Imagery (FLIR) and is suitable for tracking objects that appear small in airborne video. The second approach tracks the complete contours of the objects, and is suitable for higher level vision problems, such as activity recognition, identification and classification. Using the contours tracked by the contour tracker, we propose a novel representation, called the action sketch, for recognizing human activities.Object Tracking in Airborne Imagery: Images obtained from an airborne vehicle generally appear small and can be represented by geometric shapes such as circle or rectangle. After detecting the object position in the first frame, the proposed object tracker models the intensity and the local standard deviation of the object region defined by the shape model. It then tracks the objects by computing the mean-shift vector that minimizes the distance between the kernel distribution for the hypothesized object and its prior. In cases when the ego-motion of the sensor causes the object to move more than the operational limits of the tracking module, a multi-resolution global motion compensation using the Gabor responses of consecutive frames is performed. The experiments performed on the AMCOM FLIR data set show the robustness of the proposed method, which combines automatic model update and global motion compensation into one framework.Contour Tracker: Contour tracking is performed by evolving an initial contour toward the correct object boundaries based on discriminant analysis, which is formulated as a variational calculus problem. Once the contour is initialized, the method generates an online shape model for the object along with the color and the texture priors for both the object and the background regions. A priori texture and color PDFs of the regions are then fused based on the discrimination properties of the features between the object and the background models. The models are then used to compute the posteriori contour likelihood and the evolution is obtained by the Maximum a Posteriori Estimation process, which updates the contour in the gradient ascent direction of the proposed energy functional. During occlusion, the online shape model is used to complete the missing object region. The proposed energy functional unifies commonly used boundary and region based contour approaches into a single framework through a support region defined around the hypothesized object contour. We tested the robustness of the proposed contour tracker using several real sequences and have verified qualitatively that the contours of the objects are perfectly tracked.Behavior Analysis: We propose a novel approach to represent human actions by modeling the dynamics (motion) and the structure (shape) of the objects in video. Both the motion and the shape are modeled using a compact representation, which is called the ``action sketch''. An action sketch is a view invariant representation obtained by analyzing important changes that occur during the motion of the objects. When an actor performs an action in 3D, the points on the actor generate space-time trajectories in four dimensions $(x,y,z,t)$. Projection of the world to the imaging coordinates converts the space-time trajectories into the spatio-temporal trajectories in three dimensions $(x,y,t)$. A set of spatio-temporal trajectories constitute a 3D volume, which we call an ``action volume''. This volum
Show less - Date Issued
- 2004
- Identifier
- CFE0000101, ucf:52858
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000101
- Title
- A Study of Localization and Latency Reduction for Action Recognition.
- Creator
-
Masood, Syed, Tappen, Marshall, Foroosh, Hassan, Stanley, Kenneth, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
The success of recognizing periodic actions in single-person-simple-background datasets, such as Weizmann and KTH, has created a need for more complex datasets to push the performance of action recognition systems. In this work, we create a new synthetic action dataset and use it to highlight weaknesses in current recognition systems. Experiments show that introducing background complexity to action video sequences causes a significant degradation in recognition performance. Moreover, this...
Show moreThe success of recognizing periodic actions in single-person-simple-background datasets, such as Weizmann and KTH, has created a need for more complex datasets to push the performance of action recognition systems. In this work, we create a new synthetic action dataset and use it to highlight weaknesses in current recognition systems. Experiments show that introducing background complexity to action video sequences causes a significant degradation in recognition performance. Moreover, this degradation cannot be fixed by fine-tuning system parameters or by selecting better feature points. Instead, we show that the problem lies in the spatio-temporal cuboid volume extracted from the interest point locations. Having identified the problem, we show how improved results can be achieved by simple modifications to the cuboids.For the above method however, one requires near-perfect localization of the action within a video sequence. To achieve this objective, we present a two stage weakly supervised probabilistic model for simultaneous localization and recognition of actions in videos. Different from previous approaches, our method is novel in that it (1) eliminates the need for manual annotations for the training procedure and (2) does not require any human detection or tracking in the classification stage. The first stage of our framework is a probabilistic action localization model which extracts the most promising sub-windows in a video sequence where an action can take place. We use a non-linear classifier in the second stage of our framework for the final classification task. We show the effectiveness of our proposed model on two well known real-world datasets: UCF Sports and UCF11 datasets.Another application of the weakly supervised probablistic model proposed above is in the gaming environment. An important aspect in designing interactive, action-based interfaces is reliably recognizing actions with minimal latency. High latency causes the system's feedback to lag behind and thus significantly degrade the interactivity of the user experience. With slight modification to the weakly supervised probablistic model we proposed for action localization, we show how it can be used for reducing latency when recognizing actions in Human Computer Interaction (HCI) environments. This latency-aware learning formulation trains a logistic regression-based classifier that automatically determines distinctive canonical poses from the data and uses these to robustly recognize actions in the presence of ambiguous poses. We introduce a novel (publicly released) dataset for the purpose of our experiments. Comparisons of our method against both a Bag of Words and a Conditional Random Field (CRF) classifier show improved recognition performance for both pre-segmented and online classification tasks.
Show less - Date Issued
- 2012
- Identifier
- CFE0004575, ucf:49210
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004575
- Title
- FEATURE PRUNING FOR ACTION RECOGNITION IN COMPLEX ENVIRONMENT.
- Creator
-
Nagaraja, Adarsh, Tappen, Marshall, University of Central Florida
- Abstract / Description
-
A significant number of action recognition research efforts use spatio-temporal interest point detectors for feature extraction. Although the extracted features provide useful information for recognizing actions, a significant number of them contain irrelevant motion and background clutter. In many cases, the extracted features are included as is in the classification pipeline, and sophisticated noise removal techniques are subsequently used to alleviate their effect on classification. We...
Show moreA significant number of action recognition research efforts use spatio-temporal interest point detectors for feature extraction. Although the extracted features provide useful information for recognizing actions, a significant number of them contain irrelevant motion and background clutter. In many cases, the extracted features are included as is in the classification pipeline, and sophisticated noise removal techniques are subsequently used to alleviate their effect on classification. We introduce a new action database, created from the Weizmann database, that reveals a significant weakness in systems based on popular cuboid descriptors. Experiments show that introducing complex backgrounds, stationary or dynamic, into the video causes a significant degradation in recognition performance. Moreover, this degradation cannot be fixed by fine-tuning the system or selecting better interest points. Instead, we show that the problem lies at the descriptor level and must be addressed by modifying descriptors.
Show less - Date Issued
- 2011
- Identifier
- CFE0003882, ucf:48721
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003882
- Title
- MODELING SCENES AND HUMAN ACTIVITIES IN VIDEOS.
- Creator
-
Basharat, Arslan, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of understanding human activities in videos by developing a two-pronged approach: coarse level modeling of scene activities and fine level modeling of individual activities. At the coarse level, where the resolution of the video is low, we rely on person tracks. At the fine level, richer features are available to identify different parts of the human body, therefore we rely on the body joint tracks. There are three main goals of this dissertation: ...
Show moreIn this dissertation, we address the problem of understanding human activities in videos by developing a two-pronged approach: coarse level modeling of scene activities and fine level modeling of individual activities. At the coarse level, where the resolution of the video is low, we rely on person tracks. At the fine level, richer features are available to identify different parts of the human body, therefore we rely on the body joint tracks. There are three main goals of this dissertation: (1) identify unusual activities at the coarse level, (2) recognize different activities at the fine level, and (3) predict the behavior for synthesizing and tracking activities at the fine level. The first goal is addressed by modeling activities at the coarse level through two novel and complementing approaches. The first approach learns the behavior of individuals by capturing the patterns of motion and size of objects in a compact model. Probability density function (pdf) at each pixel is modeled as a multivariate Gaussian Mixture Model (GMM), which is learnt using unsupervised expectation maximization (EM). In contrast, the second approach learns the interaction of object pairs concurrently present in the scene. This can be useful in detecting more complex activities than those modeled by the first approach. We use a 14-dimensional Kernel Density Estimation (KDE) that captures motion and size of concurrently tracked objects. The proposed models have been successfully used to automatically detect activities like unusual person drop-off and pickup, jaywalking, etc. The second and third goals of modeling human activities at the fine level are addressed by employing concepts from theory of chaos and non-linear dynamical systems. We show that the proposed model is useful for recognition and prediction of the underlying dynamics of human activities. We treat the trajectories of human body joints as the observed time series generated from an underlying dynamical system. The observed data is used to reconstruct a phase (or state) space of appropriate dimension by employing the delay-embedding technique. This transformation is performed without assuming an exact model of the underlying dynamics and provides a characteristic representation that will prove to be vital for recognition and prediction tasks. For recognition, properties of phase space are captured in terms of dynamical and metric invariants, which include the Lyapunov exponent, correlation integral, and correlation dimension. A composite feature vector containing these invariants represents the action and will be used for classification. For prediction, kernel regression is used in the phase space to compute predictions with a specified initial condition. This approach has the advantage of modeling dynamics without making any assumptions about the exact form (polynomial, radial basis, etc.) of the mapping function. We demonstrate the utility of these predictions for human activity synthesis and tracking.
Show less - Date Issued
- 2009
- Identifier
- CFE0002897, ucf:48042
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002897
- Title
- Detecting, Tracking, and Recognizing Activities in Aerial Video.
- Creator
-
Reilly, Vladimir, Shah, Mubarak, Georgiopoulos, Michael, Stanley, Kenneth, Dogariu, Aristide, University of Central Florida
- Abstract / Description
-
In this dissertation we address the problem of detecting humans and vehicles, tracking their identities in crowded scenes, and finally determining human activities. First, we tackle the problem of detecting moving as well as stationary objects in scenes that contain parallax and shadows. We constrain the search of pedestrians and vehicles by representing them as shadow casting out of plane or (SCOOP) objects.Next, we propose a novel method for tracking a large number of densely moving objects...
Show moreIn this dissertation we address the problem of detecting humans and vehicles, tracking their identities in crowded scenes, and finally determining human activities. First, we tackle the problem of detecting moving as well as stationary objects in scenes that contain parallax and shadows. We constrain the search of pedestrians and vehicles by representing them as shadow casting out of plane or (SCOOP) objects.Next, we propose a novel method for tracking a large number of densely moving objects in aerial video. We divide the scene into grid cells to define a set of local scene constraints which we use as part of the matching cost function to solve the tracking problem which allows us to track fast-moving objects in low frame rate videos.Finally, we propose a method for recognizing human actions from few examples. We use the bag of words action representation, assume that most of the classes have many examples, and construct Support Vector Machine models for each class. We then use Support Vector Machines for classes with many examples to improve the decision function of the Support Vector Machine that was trained using few examples via late fusion of weighted decision values.
Show less - Date Issued
- 2012
- Identifier
- CFE0004627, ucf:49935
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004627
- Title
- PATTERNS OF MOTION: DISCOVERY AND GENERALIZED REPRESENTATION.
- Creator
-
Saleemi, Imran, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of discovery and representation of motion patterns in a variety of scenarios, commonly encountered in vision applications. The overarching goal is to devise a generic representation, that captures any kind of object motion observable in video sequences. Such motion is a significant source of information typically employed for diverse applications such as tracking, anomaly detection, and action and event recognition. We present statistical...
Show moreIn this dissertation, we address the problem of discovery and representation of motion patterns in a variety of scenarios, commonly encountered in vision applications. The overarching goal is to devise a generic representation, that captures any kind of object motion observable in video sequences. Such motion is a significant source of information typically employed for diverse applications such as tracking, anomaly detection, and action and event recognition. We present statistical frameworks for representation of motion characteristics of objects, learned from tracks or optical flow, for static as well as moving cameras, and propose algorithms for their application to a variety of problems. The proposed motion pattern models and learning methods are general enough to be employed in a variety of problems as we demonstrate experimentally. We first propose a novel method to model and learn the scene activity, observed by a static camera. The motion patterns of objects in the scene are modeled in the form of a multivariate non-parametric probability density function of spatiotemporal variables (object locations and transition times between them). Kernel Density Estimation (KDE) is used to learn this model in a completely unsupervised fashion. Learning is accomplished by observing the trajectories of objects by a static camera over extended periods of time. The model encodes the probabilistic nature of the behavior of moving objects in the scene and is useful for activity analysis applications, such as persistent tracking and anomalous motion detection. In addition, the model also captures salient scene features, such as, the areas of occlusion and most likely paths. Once the model is learned, we use a unified Markov Chain Monte-Carlo (MCMC) based framework for generating the most likely paths in the scene, improving foreground detection, persistent labelling of objects during tracking and deciding whether a given trajectory represents an anomaly to the observed motion patterns. Experiments with real world videos are reported which validate the proposed approach. The representation and estimation framework proposed above, however, has a few limitations. This algorithm proposes to use a single global statistical distribution to represent all kinds of motion observed in a particular scene. It therefore, does not find a separation between multiple semantically distinct motion patterns in the scene. Instead, the learned model is a joint distribution over all possible patterns followed by objects. To overcome this limitation, we then propose a superior method for the discovery and statistical representation of motion patterns in a scene. The advantages of this approach over the first one are two-fold: first, this model is applicable to scenes of dense crowded motion where tracking may not be feasible, and second, it distinguishes between motion patterns that are distinct at a semantic level of abstraction. We propose a mixture model representation of salient patterns of optical flow, and present an algorithm for learning these patterns from dense optical flow in a hierarchical, unsupervised fashion. Using low level cues of noisy optical flow, K-means is employed to initialize a Gaussian mixture model for temporally segmented clips of video. The components of this mixture are then filtered and instances of motion patterns are computed using a simple motion model, by linking components across space and time. Motion patterns are then initialized and membership of instances in different motion patterns is established by using KL divergence between mixture distributions of pattern instances. Finally, a pixel level representation of motion patterns is proposed by deriving conditional expectation of optical flow. Results of extensive experiments are presented for multiple surveillance sequences containing numerous patterns involving both pedestrian and vehicular traffic. The proposed method exploits optical flow as the low level feature and performs a hierarchical clustering to obtain motion patterns; and we observe that the use of optical flow is also an integral part of a variety of other vision applications, for example, as features based representation of human actions. We, therefore, propose a new representation for articulated human actions using the motion patterns. The representation is based on hierarchical clustering of observed optical flow in four dimensional, spatial and motion flow space. The automatically discovered motion patterns, are the primitive actions, representative of flow at salient regions on the human body, much like trajectories of body joints, which are notoriously difficult to obtain automatically. The proposed method works in a completely unsupervised fashion, and in sharp contrast to state of the art representations like bag of video words, provides a truly semantically meaningful representation. Each primitive action depicts the most atomic sub-action, like left arm moving upwards, or right leg moving downward and leftward, and is represented by a mixture of four dimensional Gaussian distributions. A sequence of primitive actions are discovered in the test video, and labelled by computing the KL divergence between mixtures. The entire video sequence containing the human action, is thus reduced to a simple string, which is matched against similar strings of training videos to classify the action. The string matching is performed by global alignment, using the well-known Needleman-Wunsch algorithm. Experiments reported on multiple human actions data sets, confirm the validity, simplicity, and semantically meaningful nature of the proposed representation. Results obtained are encouraging and comparable to the state of the art.
Show less - Date Issued
- 2011
- Identifier
- CFE0003646, ucf:48836
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003646