Current Search: human action recognition (x)
View All Items
- Title
- MULTI-VIEW GEOMETRIC CONSTRAINTS FOR HUMAN ACTION RECOGNITION AND TRACKING.
- Creator
-
GRITAI, ALEXEI, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Human actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints...
Show moreHuman actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints of two actors in the same posture can be described by a 3D rigid transformation. This transformation simultaneously captures different poses and various sizes and proportions. As a consequence of this conjecture, we show that there exists a fundamental matrix between the imaged positions of the body joints of two actors, if they are in the same posture. Second, we propose a novel projection model for cameras moving at a constant velocity in 3D space, \emph cameras, and derive the Galilean fundamental matrix and apply it to human action recognition. Third, we propose a novel use for the invariant ratio of areas under an affine transformation and utilizing the epipolar geometry between two cameras for 2D model-based tracking of human body joints. In the first part of the thesis, we propose an approach to match human actions using semantic correspondences between human bodies. These correspondences are used to provide geometric constraints between multiple anatomical landmarks ( e.g. hands, shoulders, and feet) to match actions observed from different viewpoints and performed at different rates by actors of differing anthropometric proportions. The fact that the human body has approximate anthropometric proportion allows for innovative use of the machinery of epipolar geometry to provide constraints for analyzing actions performed by people of different anthropometric sizes, while ensuring that changes in viewpoint do not affect matching. A novel measure in terms of rank of matrix constructed only from image measurements of the locations of anatomical landmarks is proposed to ensure that similar actions are accurately recognized. Finally, we describe how dynamic time warping can be used in conjunction with the proposed measure to match actions in the presence of nonlinear time warps. We demonstrate the versatility of our algorithm in a number of challenging sequences and applications including action synchronization , odd one out, following the leader, analyzing periodicity etc. Next, we extend the conventional model of image projection to video captured by a camera moving at constant velocity. We term such moving camera Galilean camera. To that end, we derive the spacetime projection and develop the corresponding epipolar geometry between two Galilean cameras. Both perspective imaging and linear pushbroom imaging form specializations of the proposed model and we show how six different ``fundamental" matrices including the classic fundamental matrix, the Linear Pushbroom (LP) fundamental matrix, and a fundamental matrix relating Epipolar Plane Images (EPIs) are related and can be directly recovered from a Galilean fundamental matrix. We provide linear algorithms for estimating the parameters of the the mapping between videos in the case of planar scenes. For applying fundamental matrix between Galilean cameras to human action recognition, we propose a measure that has two important properties. First property makes it possible to recognize similar actions, if their execution rates are linearly related. Second property allows recognizing actions in video captured by Galilean cameras. Thus, the proposed algorithm guarantees that actions can be correctly matched despite changes in view, execution rate, anthropometric proportions of the actor, and even if the camera moves with constant velocity. Finally, we also propose a novel 2D model based approach for tracking human body parts during articulated motion. The human body is modeled as a 2D stick figure of thirteen body joints and an action is considered as a sequence of these stick figures. Given the locations of these joints in every frame of a model video and the first frame of a test video, the joint locations are automatically estimated throughout the test video using two geometric constraints. First, invariance of the ratio of areas under an affine transformation is used for initial estimation of the joint locations in the test video. Second, the epipolar geometry between the two cameras is used to refine these estimates. Using these estimated joint locations, the tracking algorithm determines the exact location of each landmark in the test video using the foreground silhouettes. The novelty of the proposed approach lies in the geometric formulation of human action models, the combination of the two geometric constraints for body joints prediction, and the handling of deviations in anthropometry of individuals, viewpoints, execution rate, and style of performing action. The proposed approach does not require extensive training and can easily adapt to a wide variety of articulated actions.
Show less - Date Issued
- 2007
- Identifier
- CFE0001692, ucf:47199
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001692
- Title
- Weakly Labeled Action Recognition and Detection.
- Creator
-
Sultani, Waqas, Shah, Mubarak, Bagci, Ulas, Qi, GuoJun, Yun, Hae-Bum, University of Central Florida
- Abstract / Description
-
Research in human action recognition strives to develop increasingly generalized methods thatare robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendousstrides in improving recognition accuracy on ever larger and complex benchmark datasets, comprisingrealistic actions (")in the wild(") videos. Unfortunately, the all-encompassing, dense, globalrepresentations that bring about such improvements often benefit from the inherent characteristics,specific to...
Show moreResearch in human action recognition strives to develop increasingly generalized methods thatare robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendousstrides in improving recognition accuracy on ever larger and complex benchmark datasets, comprisingrealistic actions (")in the wild(") videos. Unfortunately, the all-encompassing, dense, globalrepresentations that bring about such improvements often benefit from the inherent characteristics,specific to datasets and classes, that do not necessarily reflect knowledge about the entity to berecognized. This results in specific models that perform well within datasets but generalize poorly.Furthermore, training of supervised action recognition and detection methods need several precisespatio-temporal manual annotations to achieve good recognition and detection accuracy. For instance,current deep learning architectures require millions of accurately annotated videos to learnrobust action classifiers. However, these annotations are quite difficult to achieve.In the first part of this dissertation, we explore the reasons for poor classifier performance whentested on novel datasets, and quantify the effect of scene backgrounds on action representationsand recognition. We attempt to address the problem of recognizing human actions while trainingand testing on distinct datasets when test videos are neither labeled nor available during training. Inthis scenario, learning of a joint vocabulary, or domain transfer techniques are not applicable. Weperform different types of partitioning of the GIST feature space for several datasets and computemeasures of background scene complexity, as well as, for the extent to which scenes are helpfulin action classification. We then propose a new process to obtain a measure of confidence in eachpixel of the video being a foreground region using motion, appearance, and saliency together in a3D-Markov Random Field (MRF) based framework. We also propose multiple ways to exploit theforeground confidence: to improve bag-of-words vocabulary, histogram representation of a video,and a novel histogram decomposition based representation and kernel.iiiThe above-mentioned work provides probability of each pixel being belonging to the actor, however,it does not give the precise spatio-temporal location of the actor. Furthermore, above frameworkwould require precise spatio-temporal manual annotations to train an action detector. However,manual annotations in videos are laborious, require several annotators and contain humanbiases. Therefore, in the second part of this dissertation, we propose a weakly labeled approachto automatically obtain spatio-temporal annotations of actors in action videos. We first obtain alarge number of action proposals in each video. To capture a few most representative action proposalsin each video and evade processing thousands of them, we rank them using optical flow andsaliency in a 3D-MRF based framework and select a few proposals using MAP based proposal subsetselection method. We demonstrate that this ranking preserves the high-quality action proposals.Several such proposals are generated for each video of the same action. Our next challenge is toiteratively select one proposal from each video so that all proposals are globally consistent. Weformulate this as Generalized Maximum Clique Graph problem (GMCP) using shape, global andfine-grained similarity of proposals across the videos. The output of our method is the most actionrepresentative proposals from each video. Using our method can also annotate multiple instancesof the same action in a video can also be annotated. Moreover, action detection experiments usingannotations obtained by our method and several baselines demonstrate the superiority of ourapproach.The above-mentioned annotation method uses multiple videos of the same action. Therefore, inthe third part of this dissertation, we tackle the problem of spatio-temporal action localization in avideo, without assuming the availability of multiple videos or any prior annotations. The action islocalized by employing images downloaded from the Internet using action label. Given web images,we first dampen image noise using random walk and evade distracting backgrounds withinimages using image action proposals. Then, given a video, we generate multiple spatio-temporalaction proposals. We suppress camera and background generated proposals by exploiting opticalivflow gradients within proposals. To obtain the most action representative proposals, we propose toreconstruct action proposals in the video by leveraging the action proposals in images. Moreover,we preserve the temporal smoothness of the video and reconstruct all proposal bounding boxesjointly using the constraints that push the coefficients for each bounding box toward a commonconsensus, thus enforcing the coefficient similarity across multiple frames. We solve this optimizationproblem using the variant of two-metric projection algorithm. Finally, the video proposalthat has the lowest reconstruction cost and is motion salient is used to localize the action. Ourmethod is not only applicable to the trimmed videos, but it can also be used for action localizationin untrimmed videos, which is a very challenging problem.Finally, in the third part of this dissertation, we propose a novel approach to generate a few properlyranked action proposals from a large number of noisy proposals. The proposed approach beginswith dividing each proposal into sub-proposals. We assume that the quality of proposal remainsthe same within each sub-proposal. We, then employ a graph optimization method to recombinethe sub-proposals in all action proposals in a single video in order to optimally build new actionproposals and rank them by the combined node and edge scores. For an untrimmed video, we firstdivide the video into shots and then make the above-mentioned graph within each shot. Our methodgenerates a few ranked proposals that can be better than all the existing underlying proposals. Ourexperimental results validated that the properly ranked action proposals can significantly boostaction detection results.Our extensive experimental results on different challenging and realistic action datasets, comparisonswith several competitive baselines and detailed analysis of each step of proposed methodsvalidate the proposed ideas and frameworks.
Show less - Date Issued
- 2017
- Identifier
- CFE0006801, ucf:51809
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006801
- Title
- GEOMETRIC INVARIANCE IN THE ANALYSIS OF HUMAN MOTION IN VIDEO DATA.
- Creator
-
Shen, Yuping, Foroosh, Hassan, University of Central Florida
- Abstract / Description
-
Human motion analysis is one of the major problems in computer vision research. It deals with the study of the motion of human body in video data from different aspects, ranging from the tracking of body parts and reconstruction of 3D human body configuration, to higher level of interpretation of human action and activities in image sequences. When human motion is observed through video camera, it is perspectively distorted and may appear totally different from different viewpoints. Therefore...
Show moreHuman motion analysis is one of the major problems in computer vision research. It deals with the study of the motion of human body in video data from different aspects, ranging from the tracking of body parts and reconstruction of 3D human body configuration, to higher level of interpretation of human action and activities in image sequences. When human motion is observed through video camera, it is perspectively distorted and may appear totally different from different viewpoints. Therefore it is highly challenging to establish correct relationships between human motions across video sequences with different camera settings. In this work, we investigate the geometric invariance in the motion of human body, which is critical to accurately understand human motion in video data regardless of variations in camera parameters and viewpoints. In human action analysis, the representation of human action is a very important issue, and it usually determines the nature of the solutions, including their limits in resolving the problem. Unlike existing research that study human motion as a whole 2D/3D object or a sequence of postures, we study human motion as a sequence of body pose transitions. We also decompose a human body pose further into a number of body point triplets, and break down a pose transition into the transition of a set of body point triplets. In this way the study of complex non-rigid motion of human body is reduced to that of the motion of rigid body point triplets, i.e. a collection of planes in motion. As a result, projective geometry and linear algebra can be applied to explore the geometric invariance in human motion. Based on this formulation, we have discovered the fundamental ratio invariant and the eigenvalue equality invariant in human motion. We also propose solutions based on these geometric invariants to the problems of view-invariant recognition of human postures and actions, as well as analysis of human motion styles. These invariants and their applicability have been validated by experimental results supporting that their effectiveness in understanding human motion with various camera parameters and viewpoints.
Show less - Date Issued
- 2009
- Identifier
- CFE0002945, ucf:47970
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002945
- Title
- Human Action Localization and Recognition in Unconstrained Videos.
- Creator
-
Boyraz, Hakan, Tappen, Marshall, Foroosh, Hassan, Lin, Mingjie, Zhang, Shaojie, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
As imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing...
Show moreAs imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing discriminative sub-regions of images and videos when performing recognition tasks. In this thesis, we address the action detection and recognition problems. Action detection in video is a particularly difficult problem because actions must not only be recognized correctly, but must also be localized in the 3D spatio-temporal volume. We introduce a technique that transforms the 3D localization problem into a series of 2D detection tasks. This is accomplished by dividing the video into overlapping segments, then representing each segment with a 2D video projection. The advantage of the 2D projection is that it makes it convenient to apply the best techniques from object detection to the action detection problem. We also introduce a novel, straightforward method for searching the 2D projections to localize actions, termed Two-Point Subwindow Search (TPSS). Finally, we show how to connect the local detections in time using a chaining algorithm to identify the entire extent of the action. Our experiments show that video projection outperforms the latest results on action detection in a direct comparison.Second, we present a probabilistic model learning to identify discriminative regions in videos from weakly-supervised data where each video clip is only assigned a label describing what action is present in the frame or clip. While our first system requires every action to be manually outlined in every frame of the video, this second system only requires that the video be given a single high-level tag. From this data, the system is able to identify discriminative regions that correspond well to the regions containing the actual actions. Our experiments on both the MSR Action Dataset II and UCF Sports Dataset show that the localizations produced by this weakly supervised system are comparable in quality to localizations produced by systems that require each frame to be manually annotated. This system is able to detect actions in both 1) non-temporally segmented action videos and 2) recognition tasks where a single label is assigned to the clip. We also demonstrate the action recognition performance of our method on two complex datasets, i.e. HMDB and UCF101. Third, we extend our weakly-supervised framework by replacing the recognition stage with a two-stage neural network and apply dropout for preventing overfitting of the parameters on the training data. Dropout technique has been recently introduced to prevent overfitting of the parameters in deep neural networks and it has been applied successfully to object recognition problem. To our knowledge, this is the first system using dropout for action recognition problem. We demonstrate that using dropout improves the action recognition accuracies on HMDB and UCF101 datasets.
Show less - Date Issued
- 2013
- Identifier
- CFE0004977, ucf:49562
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004977
- Title
- MODELING SCENES AND HUMAN ACTIVITIES IN VIDEOS.
- Creator
-
Basharat, Arslan, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of understanding human activities in videos by developing a two-pronged approach: coarse level modeling of scene activities and fine level modeling of individual activities. At the coarse level, where the resolution of the video is low, we rely on person tracks. At the fine level, richer features are available to identify different parts of the human body, therefore we rely on the body joint tracks. There are three main goals of this dissertation: ...
Show moreIn this dissertation, we address the problem of understanding human activities in videos by developing a two-pronged approach: coarse level modeling of scene activities and fine level modeling of individual activities. At the coarse level, where the resolution of the video is low, we rely on person tracks. At the fine level, richer features are available to identify different parts of the human body, therefore we rely on the body joint tracks. There are three main goals of this dissertation: (1) identify unusual activities at the coarse level, (2) recognize different activities at the fine level, and (3) predict the behavior for synthesizing and tracking activities at the fine level. The first goal is addressed by modeling activities at the coarse level through two novel and complementing approaches. The first approach learns the behavior of individuals by capturing the patterns of motion and size of objects in a compact model. Probability density function (pdf) at each pixel is modeled as a multivariate Gaussian Mixture Model (GMM), which is learnt using unsupervised expectation maximization (EM). In contrast, the second approach learns the interaction of object pairs concurrently present in the scene. This can be useful in detecting more complex activities than those modeled by the first approach. We use a 14-dimensional Kernel Density Estimation (KDE) that captures motion and size of concurrently tracked objects. The proposed models have been successfully used to automatically detect activities like unusual person drop-off and pickup, jaywalking, etc. The second and third goals of modeling human activities at the fine level are addressed by employing concepts from theory of chaos and non-linear dynamical systems. We show that the proposed model is useful for recognition and prediction of the underlying dynamics of human activities. We treat the trajectories of human body joints as the observed time series generated from an underlying dynamical system. The observed data is used to reconstruct a phase (or state) space of appropriate dimension by employing the delay-embedding technique. This transformation is performed without assuming an exact model of the underlying dynamics and provides a characteristic representation that will prove to be vital for recognition and prediction tasks. For recognition, properties of phase space are captured in terms of dynamical and metric invariants, which include the Lyapunov exponent, correlation integral, and correlation dimension. A composite feature vector containing these invariants represents the action and will be used for classification. For prediction, kernel regression is used in the phase space to compute predictions with a specified initial condition. This approach has the advantage of modeling dynamics without making any assumptions about the exact form (polynomial, radial basis, etc.) of the mapping function. We demonstrate the utility of these predictions for human activity synthesis and tracking.
Show less - Date Issued
- 2009
- Identifier
- CFE0002897, ucf:48042
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002897
- Title
- Spatial and Temporal Modeling for Human Activity Recognition from Multimodal Sequential Data.
- Creator
-
Ye, Jun, Hua, Kien, Foroosh, Hassan, Zou, Changchun, Karwowski, Waldemar, University of Central Florida
- Abstract / Description
-
Human Activity Recognition (HAR) has been an intense research area for more than a decade. Different sensors, ranging from 2D and 3D cameras to accelerometers, gyroscopes, and magnetometers, have been employed to generate multimodal signals to detect various human activities. With the advancement of sensing technology and the popularity of mobile devices, depth cameras and wearable devices, such as Microsoft Kinect and smart wristbands, open a unprecedented opportunity to solve the...
Show moreHuman Activity Recognition (HAR) has been an intense research area for more than a decade. Different sensors, ranging from 2D and 3D cameras to accelerometers, gyroscopes, and magnetometers, have been employed to generate multimodal signals to detect various human activities. With the advancement of sensing technology and the popularity of mobile devices, depth cameras and wearable devices, such as Microsoft Kinect and smart wristbands, open a unprecedented opportunity to solve the challenging HAR problem by learning expressive representations from the multimodal signals recording huge amounts of daily activities which comprise a rich set of categories.Although competitive performance has been reported, existing methods focus on the statistical or spatial representation of the human activity sequence;while the internal temporal dynamics of the human activity sequence arenot sufficiently exploited. As a result, they often face the challenge of recognizing visually similar activities composed of dynamic patterns in different temporal order. In addition, many model-driven methods based on sophisticated features and carefully-designed classifiers are computationally demanding and unable to scale to a large dataset. In this dissertation, we propose to address these challenges from three different perspectives; namely, 3D spatial relationship modeling, dynamic temporal quantization, and temporal order encoding.We propose a novel octree-based algorithm for computing the 3D spatial relationships between objects from a 3D point cloud captured by a Kinect sensor. A set of 26 3D spatial directions are defined to describe the spatial relationship of an object with respect to a reference object. These 3D directions are implemented as a set of spatial operators, such as "AboveSouthEast" and "BelowNorthWest," of an event query language to query human activities in an indoor environment; for example, "A person walks in the hallway from north to south." The performance is quantitatively evaluated in a public RGBD object dataset and qualitatively investigated in a live video computing platform.In order to address the challenge of temporal modeling in human action recognition, we introduce the dynamic temporal quantization, a clustering-like algorithm to quantize human action sequences of varied lengths into fixed-size quantized vectors. A two-step optimization algorithm is proposed to jointly optimize the quantization of the original sequence. In the aggregation step, frames falling into the sample segment are aggregated by max-polling and produce the quantized representation of the segment. During the assignment step, frame-segment assignment is updated according to dynamic time warping, while the temporal order of the entire sequence is preserved. The proposed technique is evaluated on three public 3D human action datasets and achieves state-of-the-art performance.Finally, we propose a novel temporal order encoding approach that models the temporal dynamics of the sequential data for human activity recognition. The algorithm encodes the temporal order of the latent patterns extracted by the subspace projection and generates a highly compact First-Take-All (FTA) feature vector representing the entire sequential data. An optimization algorithm is further introduced to learn the optimized projections in order to increase the discriminative power of the FTA feature. The compactness of the FTA feature makes it extremely efficient for human activity recognition with nearest neighbor search based on Hamming distance. Experimental results on two public human activity datasets demonstrate the advantages of the FTA feature over state-of-the-art methods in both accuracy and efficiency.
Show less - Date Issued
- 2016
- Identifier
- CFE0006516, ucf:51367
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006516
- Title
- Detecting, Tracking, and Recognizing Activities in Aerial Video.
- Creator
-
Reilly, Vladimir, Shah, Mubarak, Georgiopoulos, Michael, Stanley, Kenneth, Dogariu, Aristide, University of Central Florida
- Abstract / Description
-
In this dissertation we address the problem of detecting humans and vehicles, tracking their identities in crowded scenes, and finally determining human activities. First, we tackle the problem of detecting moving as well as stationary objects in scenes that contain parallax and shadows. We constrain the search of pedestrians and vehicles by representing them as shadow casting out of plane or (SCOOP) objects.Next, we propose a novel method for tracking a large number of densely moving objects...
Show moreIn this dissertation we address the problem of detecting humans and vehicles, tracking their identities in crowded scenes, and finally determining human activities. First, we tackle the problem of detecting moving as well as stationary objects in scenes that contain parallax and shadows. We constrain the search of pedestrians and vehicles by representing them as shadow casting out of plane or (SCOOP) objects.Next, we propose a novel method for tracking a large number of densely moving objects in aerial video. We divide the scene into grid cells to define a set of local scene constraints which we use as part of the matching cost function to solve the tracking problem which allows us to track fast-moving objects in low frame rate videos.Finally, we propose a method for recognizing human actions from few examples. We use the bag of words action representation, assume that most of the classes have many examples, and construct Support Vector Machine models for each class. We then use Support Vector Machines for classes with many examples to improve the decision function of the Support Vector Machine that was trained using few examples via late fusion of weighted decision values.
Show less - Date Issued
- 2012
- Identifier
- CFE0004627, ucf:49935
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004627