Current Search: Event Detection (x)
View All Items
- Title
- Learning Hierarchical Representations for Video Analysis Using Deep Learning.
- Creator
-
Yang, Yang, Shah, Mubarak, Sukthankar, Gita, Da Vitoria Lobo, Niels, Stanley, Kenneth, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
With the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non...
Show moreWith the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non-linear transformations of the data, with the goal of yielding more abstract and ultimately more useful representations. The advantages of the deep models are three fold: 1) They learn the features directly from the raw signal in contrast to the hand-designed features. 2) The learning can be unsupervised, which is suitable for large data where labeling all the data is expensive and unpractical. 3) They learn a hierarchy of features one level at a time and the layerwise stacking of feature extraction, this often yields better representations.However, not many deep learning models have been proposed to solve the problems in video analysis, especially videos ``in a wild''. Most of them are either dealing with simple datasets, or limited to the low-level local spatial-temporal feature descriptors for action recognition. Moreover, as the learning algorithms are unsupervised, the learned features preserve generative properties rather than the discriminative ones which are more favorable in the classification tasks. In this context, the thesis makes two major contributions.First, we propose several formulations and extensions of deep learning methods which learn hierarchical representations for three challenging video analysis tasks, including complex event recognition, object detection in videos and measuring action similarity. The proposed methods are extensively demonstrated for each work on the state-of-the-art challenging datasets. Besides learning the low-level local features, higher level representations are further designed to be learned in the context of applications. The data-driven concept representations and sparse representation of the events are learned for complex event recognition; the representations for object body parts and structures are learned for object detection in videos; and the relational motion features and similarity metrics between video pairs are learned simultaneously for action verification.Second, in order to learn discriminative and compact features, we propose a new feature learning method using a deep neural network based on auto encoders. It differs from the existing unsupervised feature learning methods in two ways: first it optimizes both discriminative and generative properties of the features simultaneously, which gives our features a better discriminative ability. Second, our learned features are more compact, while the unsupervised feature learning methods usually learn a redundant set of over-complete features. Extensive experiments with quantitative and qualitative results on the tasks of human detection and action verification demonstrate the superiority of our proposed models.
Show less - Date Issued
- 2013
- Identifier
- CFE0004964, ucf:49593
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004964
- Title
- LEARNING, DETECTION, REPRESENTATION, INDEXING AND RETRIEVAL OF MULTI-AGENT EVENTS IN VIDEOS.
- Creator
-
Hakeem, Asaad, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
The world that we live in is a complex network of agents and their interactions which are termed as events. An instance of an event is composed of directly measurable low-level actions (which I term sub-events) having a temporal order. Also, the agents can act independently (e.g. voting) as well as collectively (e.g. scoring a touch-down in a football game) to perform an event. With the dawn of the new millennium, the low-level vision tasks such as segmentation, object classification, and...
Show moreThe world that we live in is a complex network of agents and their interactions which are termed as events. An instance of an event is composed of directly measurable low-level actions (which I term sub-events) having a temporal order. Also, the agents can act independently (e.g. voting) as well as collectively (e.g. scoring a touch-down in a football game) to perform an event. With the dawn of the new millennium, the low-level vision tasks such as segmentation, object classification, and tracking have become fairly robust. But a representational gap still exists between low-level measurements and high-level understanding of video sequences. This dissertation is an effort to bridge that gap where I propose novel learning, detection, representation, indexing and retrieval approaches for multi-agent events in videos. In order to achieve the goal of high-level understanding of videos, firstly, I apply statistical learning techniques to model the multiple agent events. For that purpose, I use the training videos to model the events by estimating the conditional dependencies between sub-events. Thus, given a video sequence, I track the people (heads and hand regions) and objects using a Meanshift tracker. An underlying rule-based system detects the sub-events using the tracked trajectories of the people and objects, based on their relative motion. Next, an event model is constructed by estimating the sub-event dependencies, that is, how frequently sub-event B occurs given that sub-event A has occurred. The advantages of such an event model are two-fold. First, I do not require prior knowledge of the number of agents involved in an event. Second, no assumptions are made about the length of an event. Secondly, after learning the event models, I detect events in a novel video by using graph clustering techniques. To that end, I construct a graph of temporally ordered sub-events occurring in the novel video. Next, using the learnt event model, I estimate a weight matrix of conditional dependencies between sub-events in the novel video. Further application of Normalized Cut (graph clustering technique) on the estimated weight matrix facilitate in detecting events in the novel video. The principal assumption made in this work is that the events are composed of highly correlated chains of sub-events that have high conditional dependency (association) within the cluster and relatively low conditional dependency (disassociation) between clusters. Thirdly, in order to represent the detected events, I propose an extension of CASE representation of natural languages. I extend CASE to allow the representation of temporal structure between sub-events. Also, in order to capture both multi-agent and multi-threaded events, I introduce a hierarchical CASE representation of events in terms of sub-events and case-lists. The essence of the proposition is that, based on the temporal relationships of the agent motions and a description of its state, it is possible to build a formal description of an event. Furthermore, I recognize the importance of representing the variations in the temporal order of sub-events, that may occur in an event, and encode the temporal probabilities directly into my event representation. The proposed extended representation with probabilistic temporal encoding is termed P-CASE that allows a plausible means of interface between users and the computer. Using the P-CASE representation I automatically encode the event ontology from training videos. This offers a significant advantage, since the domain experts do not have to go through the tedious task of determining the structure of events by browsing all the videos. Finally, I utilize the event representation for indexing and retrieval of events. Given the different instances of a particular event, I index the events using the P-CASE representation. Next, given a query in the P-CASE representation, event retrieval is performed using a two-level search. At the first level, a maximum likelihood estimate of the query event with the different indexed event models is computed. This provides the maximum matching event. At the second level, a matching score is obtained for all the event instances belonging to the maximum matched event model, using a weighted Jaccard similarity measure. Extensive experimentation was conducted for the detection, representation, indexing and retrieval of multiple agent events in videos of the meeting, surveillance, and railroad monitoring domains. To that end, the Semoran system was developed that takes in user inputs in any of the three forms for event retrieval: using predefined queries in P-CASE representation, using custom queries in P-CASE representation, or query by example video. The system then searches the entire database and returns the matched videos to the user. I used seven standard video datasets from the computer vision community as well as my own videos for testing the robustness of the proposed methods.
Show less - Date Issued
- 2007
- Identifier
- CFE0001620, ucf:47163
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001620
- Title
- Recognition of Complex Events in Open-source Web-scale Videos: Features, Intermediate Representations and their Temporal Interactions.
- Creator
-
Bhattacharya, Subhabrata, Shah, Mubarak, Guha, Ratan, Laviola II, Joseph, Sukthankar, Rahul, Moore, Brian, University of Central Florida
- Abstract / Description
-
Recognition of complex events in consumer uploaded Internet videos, captured under real-world settings, has emerged as a challenging area of research across both computer vision and multimedia community. In this dissertation, we present a systematic decomposition of complex events into hierarchical components and make an in-depth analysis of how existing research are being used to cater to various levels of this hierarchy and identify three key stages where we make novel contributions,...
Show moreRecognition of complex events in consumer uploaded Internet videos, captured under real-world settings, has emerged as a challenging area of research across both computer vision and multimedia community. In this dissertation, we present a systematic decomposition of complex events into hierarchical components and make an in-depth analysis of how existing research are being used to cater to various levels of this hierarchy and identify three key stages where we make novel contributions, keeping complex events in focus. These are listed as follows: (a) Extraction of novel semi-global features -- firstly, we introduce a Lie-algebra based representation of dominant camera motion present while capturing videos and show how this can be used as a complementary feature for video analysis. Secondly, we propose compact clip level descriptors of a video based on covariance of appearance and motion features which we further use in a sparse coding framework to recognize realistic actions and gestures. (b) Construction of intermediate representations -- We propose an efficient probabilistic representation from low-level features computed from videos, basedon Maximum Likelihood Estimates which demonstrates state of the art performancein large scale visual concept detection, and finally, (c) Modeling temporal interactions between intermediate concepts -- Using block Hankel matrices and harmonic analysis of slowly evolving Linear Dynamical Systems, we propose two new discriminative feature spaces for complex event recognition and demonstratesignificantly improved recognition rates over previously proposed approaches.
Show less - Date Issued
- 2013
- Identifier
- CFE0004817, ucf:49724
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004817
- Title
- Analysis of Behaviors in Crowd Videos.
- Creator
-
Mehran, Ramin, Shah, Mubarak, Sukthankar, Gita, Behal, Aman, Tappen, Marshall, Moore, Brian, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of discovery and representation of group activity of humans and objects in a variety of scenarios, commonly encountered in vision applications. The overarching goal is to devise a discriminative representation of human motion in social settings, which captures a wide variety of human activities observable in video sequences. Such motion emerges from the collective behavior of individuals and their interactions and is a significant source of...
Show moreIn this dissertation, we address the problem of discovery and representation of group activity of humans and objects in a variety of scenarios, commonly encountered in vision applications. The overarching goal is to devise a discriminative representation of human motion in social settings, which captures a wide variety of human activities observable in video sequences. Such motion emerges from the collective behavior of individuals and their interactions and is a significant source of information typically employed for applications such as event detection, behavior recognition, and activity recognition. We present new representations of human group motion for static cameras, and propose algorithms for their application to variety of problems.We first propose a method to model and learn the scene activity of a crowd using Social Force Model for the first time in the computer vision community. We present a method to densely estimate the interaction forces between people in a crowd, observed by a static camera. Latent Dirichlet Allocation (LDA) is used to learn the model of the normal activities over extended periods of time. Randomly selected spatio-temporal volumes of interaction forces are used to learn the model of normal behavior of the scene. The model encodes the latent topics of social interaction forces in the scene for normal behaviors. We classify a short video sequence of $n$ frames as normal or abnormal by using the learnt model. Once a sequence of frames is classified as an abnormal, the regions of anomalies in the abnormal frames are localized using the magnitude of interaction forces.The representation and estimation framework proposed above, however, has a few limitations. This algorithm proposes to use a global estimation of the interaction forces within the crowd. It, therefore, is incapable of identifying different groups of objects based on motion or behavior in the scene. Although the algorithm is capable of learning the normal behavior and detects the abnormality, but it is incapable of capturing the dynamics of different behaviors.To overcome these limitations, we then propose a method based on the Lagrangian framework for fluid dynamics, by introducing a streakline representation of flow. Streaklines are traced in a fluid flow by injecting color material, such as smoke or dye, which is transported with the flow and used for visualization. In the context of computer vision, streaklines may be used in a similar way to transport information about a scene, and they are obtained by repeatedly initializing a fixed grid of particles at each frame, then moving both current and past particles using optical flow. Streaklines are the locus of points that connect particles which originated from the same initial position.This approach is advantageous over the previous representations in two aspects: first, its rich representation captures the dynamics of the crowd and changes in space and time in the scene where the optical flow representation is not enough, and second, this model is capable of discovering groups of similar behavior within a crowd scene by performing motion segmentation. We propose a method to distinguish different group behaviors such as divergent/convergent motion and lanes using this framework. Finally, we introduce flow potentials as a discriminative feature to recognize crowd behaviors in a scene. Results of extensive experiments are presented for multiple real life crowd sequences involving pedestrian and vehicular traffic.The proposed method exploits optical flow as the low level feature and performs integration and clustering to obtain coherent group motion patterns. However, we observe that in crowd video sequences, as well as a variety of other vision applications, the co-occurrence and inter-relation of motion patterns are the main characteristics of group behaviors. In other words, the group behavior of objects is a mixture of individual actions or behaviors in specific geometrical layout and temporal order.We, therefore, propose a new representation for group behaviors of humans using the inter-relation of motion patterns in a scene. The representation is based on bag of visual phrases of spatio-temporal visual words. We present a method to match the high-order spatial layout of visual words that preserve the geometry of the visual words under similarity transformations. To perform the experiments we collected a dataset of group choreography performances from the YouTube website. The dataset currently contains four categories of group dances.
Show less - Date Issued
- 2011
- Identifier
- CFE0004482, ucf:49317
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004482
- Title
- Robust Subspace Estimation Using Low-Rank Optimization. Theory and Applications in Scene Reconstruction, Video Denoising, and Activity Recognition.
- Creator
-
Oreifej, Omar, Shah, Mubarak, Da Vitoria Lobo, Niels, Stanley, Kenneth, Lin, Mingjie, Li, Xin, University of Central Florida
- Abstract / Description
-
In this dissertation, we discuss the problem of robust linear subspace estimation using low-rank optimization and propose three formulations of it. We demonstrate how these formulations can be used to solve fundamental computer vision problems, and provide superior performance in terms of accuracy and running time.Consider a set of observations extracted from images (such as pixel gray values, local features, trajectories...etc). If the assumption that these observations are drawn from a...
Show moreIn this dissertation, we discuss the problem of robust linear subspace estimation using low-rank optimization and propose three formulations of it. We demonstrate how these formulations can be used to solve fundamental computer vision problems, and provide superior performance in terms of accuracy and running time.Consider a set of observations extracted from images (such as pixel gray values, local features, trajectories...etc). If the assumption that these observations are drawn from a liner subspace (or can be linearly approximated) is valid, then the goal is to represent each observation as a linear combination of a compact basis, while maintaining a minimal reconstruction error. One of the earliest, yet most popular, approaches to achieve that is Principal Component Analysis (PCA). However, PCA can only handle Gaussian noise, and thus suffers when the observations are contaminated with gross and sparse outliers. To this end, in this dissertation, we focus on estimating the subspace robustly using low-rank optimization, where the sparse outliers are detected and separated through the `1 norm. The robust estimation has a two-fold advantage: First, the obtained basis better represents the actual subspace because it does not include contributions from the outliers. Second, the detected outliers are often of a specific interest in many applications, as we will show throughout this thesis. We demonstrate four different formulations and applications for low-rank optimization. First, we consider the problem of reconstructing an underwater sequence by removing the turbulence caused by the water waves. The main drawback of most previous attempts to tackle this problem is that they heavily depend on modelling the waves, which in fact is ill-posed since the actual behavior of the waves along with the imaging process are complicated and include several noise components; therefore, their results are not satisfactory. In contrast, we propose a novel approach which outperforms the state-of-the-art. The intuition behind our method is that in a sequence where the water is static, the frames would be linearly correlated. Therefore, in the presence of water waves, we may consider the frames as noisy observations drawn from a the subspace of linearly correlated frames. However, the noise introduced by the water waves is not sparse, and thus cannot directly be detected using low-rank optimization. Therefore, we propose a data-driven two-stage approach, where the first stage (")sparsifies(") the noise, and the second stage detects it. The first stage leverages the temporal mean of the sequence to overcome the structured turbulence of the waves through an iterative registration algorithm. The result of the first stage is a high quality mean and a better structured sequence; however, the sequence still contains unstructured sparse noise. Thus, we employ a second stage at which we extract the sparse errors from the sequence through rank minimization. Our method converges faster, and drastically outperforms state of the art on all testing sequences. Secondly, we consider a closely related situation where an independently moving object is also present in the turbulent video. More precisely, we consider video sequences acquired in a desert battlefields, where atmospheric turbulence is typically present, in addition to independently moving targets. Typical approaches for turbulence mitigation follow averaging or de-warping techniques. Although these methods can reduce the turbulence, they distort the independently moving objects which can often be of great interest. Therefore, we address the problem of simultaneous turbulence mitigation and moving object detection. We propose a novel three-term low-rank matrix decomposition approach in which we decompose the turbulence sequence into three components: the background, the turbulence, and the object. We simplify this extremely difficult problem into a minimization of nuclear norm, Frobenius norm, and L1 norm. Our method is based on two observations: First, the turbulence causes dense and Gaussian noise, and therefore can be captured by Frobenius norm, while the moving objects are sparse and thus can be captured by L1 norm. Second, since the object's motion is linear and intrinsically different than the Gaussian-like turbulence, a Gaussian-based turbulence model can be employed to enforce an additional constraint on the search space of the minimization. We demonstrate the robustness of our approach on challenging sequences which are significantly distorted with atmospheric turbulence and include extremely tiny moving objects. In addition to robustly detecting the subspace of the frames of a sequence, we consider using trajectories as observations in the low-rank optimization framework. In particular, in videos acquired by moving cameras, we track all the pixels in the video and use that to estimate the camera motion subspace. This is particularly useful in activity recognition, which typically requires standard preprocessing steps such as motion compensation, moving object detection, and object tracking. The errors from the motion compensation step propagate to the object detection stage, resulting in miss-detections, which further complicates the tracking stage, resulting in cluttered and incorrect tracks. In contrast, we propose a novel approach which does not follow the standard steps, and accordingly avoids the aforementioned difficulties. Our approach is based on Lagrangian particle trajectories which are a set of dense trajectories obtained by advecting optical flow over time, thus capturing the ensemble motions of a scene. This is done in frames of unaligned video, and no object detection is required. In order to handle the moving camera, we decompose the trajectories into their camera-induced and object-induced components. Having obtained the relevant object motion trajectories, we compute a compact set of chaotic invariant features, which captures the characteristics of the trajectories. Consequently, a SVM is employed to learn and recognize the human actions using the computed motion features. We performed intensive experiments on multiple benchmark datasets, and obtained promising results.Finally, we consider a more challenging problem referred to as complex event recognition, where the activities of interest are complex and unconstrained. This problem typically pose significant challenges because it involves videos of highly variable content, noise, length, frame size ... etc. In this extremely challenging task, high-level features have recently shown a promising direction as in [53, 129], where core low-level events referred to as concepts are annotated and modeled using a portion of the training data, then each event is described using its content of these concepts. However, because of the complex nature of the videos, both the concept models and the corresponding high-level features are significantly noisy. In order to address this problem, we propose a novel low-rank formulation, which combines the precisely annotated videos used to train the concepts, with the rich high-level features. Our approach finds a new representation for each event, which is not only low-rank, but also constrained to adhere to the concept annotation, thus suppressing the noise, and maintaining a consistent occurrence of the concepts in each event. Extensive experiments on large scale real world dataset TRECVID Multimedia Event Detection 2011 and 2012 demonstrate that our approach consistently improves the discriminativity of the high-level features by a significant margin.
Show less - Date Issued
- 2013
- Identifier
- CFE0004732, ucf:49835
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004732