Current Search: recognition (x)
View All Items
Pages
- Title
- Recognition of Complex Events in Open-source Web-scale Videos: Features, Intermediate Representations and their Temporal Interactions.
- Creator
-
Bhattacharya, Subhabrata, Shah, Mubarak, Guha, Ratan, Laviola II, Joseph, Sukthankar, Rahul, Moore, Brian, University of Central Florida
- Abstract / Description
-
Recognition of complex events in consumer uploaded Internet videos, captured under real-world settings, has emerged as a challenging area of research across both computer vision and multimedia community. In this dissertation, we present a systematic decomposition of complex events into hierarchical components and make an in-depth analysis of how existing research are being used to cater to various levels of this hierarchy and identify three key stages where we make novel contributions,...
Show moreRecognition of complex events in consumer uploaded Internet videos, captured under real-world settings, has emerged as a challenging area of research across both computer vision and multimedia community. In this dissertation, we present a systematic decomposition of complex events into hierarchical components and make an in-depth analysis of how existing research are being used to cater to various levels of this hierarchy and identify three key stages where we make novel contributions, keeping complex events in focus. These are listed as follows: (a) Extraction of novel semi-global features -- firstly, we introduce a Lie-algebra based representation of dominant camera motion present while capturing videos and show how this can be used as a complementary feature for video analysis. Secondly, we propose compact clip level descriptors of a video based on covariance of appearance and motion features which we further use in a sparse coding framework to recognize realistic actions and gestures. (b) Construction of intermediate representations -- We propose an efficient probabilistic representation from low-level features computed from videos, basedon Maximum Likelihood Estimates which demonstrates state of the art performancein large scale visual concept detection, and finally, (c) Modeling temporal interactions between intermediate concepts -- Using block Hankel matrices and harmonic analysis of slowly evolving Linear Dynamical Systems, we propose two new discriminative feature spaces for complex event recognition and demonstratesignificantly improved recognition rates over previously proposed approaches.
Show less - Date Issued
- 2013
- Identifier
- CFE0004817, ucf:49724
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004817
- Title
- SketChart: A Pen-Based Tool for Chart Generation and Interaction.
- Creator
-
Vargas Gonzalez, Andres, Laviola II, Joseph, Foroosh, Hassan, Hua, Kien, University of Central Florida
- Abstract / Description
-
It has been shown that representing data with the right visualization increases the understanding of qualitative and quantitative information encoded in documents. However, current tools for generating such visualizations involve the use of traditional WIMP techniques, which perhaps makes free interaction and direct manipulation of the content harder. In this thesis, we present a pen-based prototype for data visualization using 10 different types of bar based charts. The prototype lets users...
Show moreIt has been shown that representing data with the right visualization increases the understanding of qualitative and quantitative information encoded in documents. However, current tools for generating such visualizations involve the use of traditional WIMP techniques, which perhaps makes free interaction and direct manipulation of the content harder. In this thesis, we present a pen-based prototype for data visualization using 10 different types of bar based charts. The prototype lets users sketch a chart and interact with the information once the drawing is identified. The prototype's user interface consists of an area to sketch and touch based elements that will be displayed depending on the context and nature of the outline. Brainstorming and live presentations can benefit from the prototype due to the ability to visualize and manipulate data in real time. We also perform a short, informal user study to measure effectiveness of the tool while recognizing sketches and users acceptance while interacting with the system. Results show SketChart strengths and weaknesses and areas for improvement.
Show less - Date Issued
- 2014
- Identifier
- CFE0005434, ucf:50405
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005434
- Title
- Robust Subspace Estimation Using Low-Rank Optimization. Theory and Applications in Scene Reconstruction, Video Denoising, and Activity Recognition.
- Creator
-
Oreifej, Omar, Shah, Mubarak, Da Vitoria Lobo, Niels, Stanley, Kenneth, Lin, Mingjie, Li, Xin, University of Central Florida
- Abstract / Description
-
In this dissertation, we discuss the problem of robust linear subspace estimation using low-rank optimization and propose three formulations of it. We demonstrate how these formulations can be used to solve fundamental computer vision problems, and provide superior performance in terms of accuracy and running time.Consider a set of observations extracted from images (such as pixel gray values, local features, trajectories...etc). If the assumption that these observations are drawn from a...
Show moreIn this dissertation, we discuss the problem of robust linear subspace estimation using low-rank optimization and propose three formulations of it. We demonstrate how these formulations can be used to solve fundamental computer vision problems, and provide superior performance in terms of accuracy and running time.Consider a set of observations extracted from images (such as pixel gray values, local features, trajectories...etc). If the assumption that these observations are drawn from a liner subspace (or can be linearly approximated) is valid, then the goal is to represent each observation as a linear combination of a compact basis, while maintaining a minimal reconstruction error. One of the earliest, yet most popular, approaches to achieve that is Principal Component Analysis (PCA). However, PCA can only handle Gaussian noise, and thus suffers when the observations are contaminated with gross and sparse outliers. To this end, in this dissertation, we focus on estimating the subspace robustly using low-rank optimization, where the sparse outliers are detected and separated through the `1 norm. The robust estimation has a two-fold advantage: First, the obtained basis better represents the actual subspace because it does not include contributions from the outliers. Second, the detected outliers are often of a specific interest in many applications, as we will show throughout this thesis. We demonstrate four different formulations and applications for low-rank optimization. First, we consider the problem of reconstructing an underwater sequence by removing the turbulence caused by the water waves. The main drawback of most previous attempts to tackle this problem is that they heavily depend on modelling the waves, which in fact is ill-posed since the actual behavior of the waves along with the imaging process are complicated and include several noise components; therefore, their results are not satisfactory. In contrast, we propose a novel approach which outperforms the state-of-the-art. The intuition behind our method is that in a sequence where the water is static, the frames would be linearly correlated. Therefore, in the presence of water waves, we may consider the frames as noisy observations drawn from a the subspace of linearly correlated frames. However, the noise introduced by the water waves is not sparse, and thus cannot directly be detected using low-rank optimization. Therefore, we propose a data-driven two-stage approach, where the first stage (")sparsifies(") the noise, and the second stage detects it. The first stage leverages the temporal mean of the sequence to overcome the structured turbulence of the waves through an iterative registration algorithm. The result of the first stage is a high quality mean and a better structured sequence; however, the sequence still contains unstructured sparse noise. Thus, we employ a second stage at which we extract the sparse errors from the sequence through rank minimization. Our method converges faster, and drastically outperforms state of the art on all testing sequences. Secondly, we consider a closely related situation where an independently moving object is also present in the turbulent video. More precisely, we consider video sequences acquired in a desert battlefields, where atmospheric turbulence is typically present, in addition to independently moving targets. Typical approaches for turbulence mitigation follow averaging or de-warping techniques. Although these methods can reduce the turbulence, they distort the independently moving objects which can often be of great interest. Therefore, we address the problem of simultaneous turbulence mitigation and moving object detection. We propose a novel three-term low-rank matrix decomposition approach in which we decompose the turbulence sequence into three components: the background, the turbulence, and the object. We simplify this extremely difficult problem into a minimization of nuclear norm, Frobenius norm, and L1 norm. Our method is based on two observations: First, the turbulence causes dense and Gaussian noise, and therefore can be captured by Frobenius norm, while the moving objects are sparse and thus can be captured by L1 norm. Second, since the object's motion is linear and intrinsically different than the Gaussian-like turbulence, a Gaussian-based turbulence model can be employed to enforce an additional constraint on the search space of the minimization. We demonstrate the robustness of our approach on challenging sequences which are significantly distorted with atmospheric turbulence and include extremely tiny moving objects. In addition to robustly detecting the subspace of the frames of a sequence, we consider using trajectories as observations in the low-rank optimization framework. In particular, in videos acquired by moving cameras, we track all the pixels in the video and use that to estimate the camera motion subspace. This is particularly useful in activity recognition, which typically requires standard preprocessing steps such as motion compensation, moving object detection, and object tracking. The errors from the motion compensation step propagate to the object detection stage, resulting in miss-detections, which further complicates the tracking stage, resulting in cluttered and incorrect tracks. In contrast, we propose a novel approach which does not follow the standard steps, and accordingly avoids the aforementioned difficulties. Our approach is based on Lagrangian particle trajectories which are a set of dense trajectories obtained by advecting optical flow over time, thus capturing the ensemble motions of a scene. This is done in frames of unaligned video, and no object detection is required. In order to handle the moving camera, we decompose the trajectories into their camera-induced and object-induced components. Having obtained the relevant object motion trajectories, we compute a compact set of chaotic invariant features, which captures the characteristics of the trajectories. Consequently, a SVM is employed to learn and recognize the human actions using the computed motion features. We performed intensive experiments on multiple benchmark datasets, and obtained promising results.Finally, we consider a more challenging problem referred to as complex event recognition, where the activities of interest are complex and unconstrained. This problem typically pose significant challenges because it involves videos of highly variable content, noise, length, frame size ... etc. In this extremely challenging task, high-level features have recently shown a promising direction as in [53, 129], where core low-level events referred to as concepts are annotated and modeled using a portion of the training data, then each event is described using its content of these concepts. However, because of the complex nature of the videos, both the concept models and the corresponding high-level features are significantly noisy. In order to address this problem, we propose a novel low-rank formulation, which combines the precisely annotated videos used to train the concepts, with the rich high-level features. Our approach finds a new representation for each event, which is not only low-rank, but also constrained to adhere to the concept annotation, thus suppressing the noise, and maintaining a consistent occurrence of the concepts in each event. Extensive experiments on large scale real world dataset TRECVID Multimedia Event Detection 2011 and 2012 demonstrate that our approach consistently improves the discriminativity of the high-level features by a significant margin.
Show less - Date Issued
- 2013
- Identifier
- CFE0004732, ucf:49835
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004732
- Title
- Facilitating Information Retrieval in Social Media User Interfaces.
- Creator
-
Costello, Anthony, Tang, Yubo, Fiore, Stephen, Goldiez, Brian, University of Central Florida
- Abstract / Description
-
As the amount of computer mediated information (e.g., emails, documents, multi-media) we need to process grows, our need to rapidly sort, organize and store electronic information likewise increases. In order to store information effectively, we must find ways to sort through it and organize it in a manner that facilitates efficient retrieval. The instantaneous and emergent nature of communications across networks like Twitter makes them suitable for discussing events (e.g., natural disasters...
Show moreAs the amount of computer mediated information (e.g., emails, documents, multi-media) we need to process grows, our need to rapidly sort, organize and store electronic information likewise increases. In order to store information effectively, we must find ways to sort through it and organize it in a manner that facilitates efficient retrieval. The instantaneous and emergent nature of communications across networks like Twitter makes them suitable for discussing events (e.g., natural disasters) that are amorphous and prone to rapid changes. It can be difficult for an individual human to filter through and organize the large amounts of information that can pass through these types of social networks when events are unfolding rapidly. A common feature of social networks is the images (e.g., human faces, inanimate objects) that are often used by those who send messages across these networks. Humans have a particularly strong ability to recognize and differentiate between human Faces. This effect may also extend to recalling information associated with each human Face. This study investigated the difference between human Face images, non-human Face images and alphanumeric labels as retrieval cues under different levels of Task Load. Participants were required to recall key pieces of event information as they emerged from a Twitter-style message feed during a simulated natural disaster. A counter-balanced within-subjects design was used for this experiment. Participants were exposed to low, medium and high Task Load while responding to five different types of recall cues: (1) Nickname, (2) Non-Face, (3) Non-Face (&) Nickname, (4) Face and (5) Face (&) Nickname. The task required participants to organize information regarding emergencies (e.g., car accidents) from a Twitter-style message feed. The messages reported various events such as fires occurring around a fictional city. Each message was associated with a different recall cue type, depending on the experimental condition. Following the task, participants were asked to recall the information associated with one of the cues they worked with during the task. Results indicate that under medium and high Task Load, both Non-Face and Face retrieval cues increased recall performance over Nickname alone with Non-Faces resulting in the highest mean recall scores. When comparing medium to high Task Load: Face (&) Nickname and Non-Face significantly outperformed the Face condition. The performance in Non-Face (&) Nickname was significantly better than Face (&) Nickname. No significant difference was found between Non-Faces and Non-Faces (&) Nickname. Subjective Task Load scores indicate that participants experienced lower mental workload when using Non-Face cues than using Nickname or Face cues. Generally, these results indicate that under medium and high Task Load levels, images outperformed alphanumeric nicknames, Non-Face images outperformed Face images, and combining alphanumeric nicknames with images may have offered a significant performance advantage only when the image is that of a Face. Both theoretical and practical design implications are provided from these findings.
Show less - Date Issued
- 2014
- Identifier
- CFE0005318, ucf:50524
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005318
- Title
- PATTERNS OF MOTION: DISCOVERY AND GENERALIZED REPRESENTATION.
- Creator
-
Saleemi, Imran, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of discovery and representation of motion patterns in a variety of scenarios, commonly encountered in vision applications. The overarching goal is to devise a generic representation, that captures any kind of object motion observable in video sequences. Such motion is a significant source of information typically employed for diverse applications such as tracking, anomaly detection, and action and event recognition. We present statistical...
Show moreIn this dissertation, we address the problem of discovery and representation of motion patterns in a variety of scenarios, commonly encountered in vision applications. The overarching goal is to devise a generic representation, that captures any kind of object motion observable in video sequences. Such motion is a significant source of information typically employed for diverse applications such as tracking, anomaly detection, and action and event recognition. We present statistical frameworks for representation of motion characteristics of objects, learned from tracks or optical flow, for static as well as moving cameras, and propose algorithms for their application to a variety of problems. The proposed motion pattern models and learning methods are general enough to be employed in a variety of problems as we demonstrate experimentally. We first propose a novel method to model and learn the scene activity, observed by a static camera. The motion patterns of objects in the scene are modeled in the form of a multivariate non-parametric probability density function of spatiotemporal variables (object locations and transition times between them). Kernel Density Estimation (KDE) is used to learn this model in a completely unsupervised fashion. Learning is accomplished by observing the trajectories of objects by a static camera over extended periods of time. The model encodes the probabilistic nature of the behavior of moving objects in the scene and is useful for activity analysis applications, such as persistent tracking and anomalous motion detection. In addition, the model also captures salient scene features, such as, the areas of occlusion and most likely paths. Once the model is learned, we use a unified Markov Chain Monte-Carlo (MCMC) based framework for generating the most likely paths in the scene, improving foreground detection, persistent labelling of objects during tracking and deciding whether a given trajectory represents an anomaly to the observed motion patterns. Experiments with real world videos are reported which validate the proposed approach. The representation and estimation framework proposed above, however, has a few limitations. This algorithm proposes to use a single global statistical distribution to represent all kinds of motion observed in a particular scene. It therefore, does not find a separation between multiple semantically distinct motion patterns in the scene. Instead, the learned model is a joint distribution over all possible patterns followed by objects. To overcome this limitation, we then propose a superior method for the discovery and statistical representation of motion patterns in a scene. The advantages of this approach over the first one are two-fold: first, this model is applicable to scenes of dense crowded motion where tracking may not be feasible, and second, it distinguishes between motion patterns that are distinct at a semantic level of abstraction. We propose a mixture model representation of salient patterns of optical flow, and present an algorithm for learning these patterns from dense optical flow in a hierarchical, unsupervised fashion. Using low level cues of noisy optical flow, K-means is employed to initialize a Gaussian mixture model for temporally segmented clips of video. The components of this mixture are then filtered and instances of motion patterns are computed using a simple motion model, by linking components across space and time. Motion patterns are then initialized and membership of instances in different motion patterns is established by using KL divergence between mixture distributions of pattern instances. Finally, a pixel level representation of motion patterns is proposed by deriving conditional expectation of optical flow. Results of extensive experiments are presented for multiple surveillance sequences containing numerous patterns involving both pedestrian and vehicular traffic. The proposed method exploits optical flow as the low level feature and performs a hierarchical clustering to obtain motion patterns; and we observe that the use of optical flow is also an integral part of a variety of other vision applications, for example, as features based representation of human actions. We, therefore, propose a new representation for articulated human actions using the motion patterns. The representation is based on hierarchical clustering of observed optical flow in four dimensional, spatial and motion flow space. The automatically discovered motion patterns, are the primitive actions, representative of flow at salient regions on the human body, much like trajectories of body joints, which are notoriously difficult to obtain automatically. The proposed method works in a completely unsupervised fashion, and in sharp contrast to state of the art representations like bag of video words, provides a truly semantically meaningful representation. Each primitive action depicts the most atomic sub-action, like left arm moving upwards, or right leg moving downward and leftward, and is represented by a mixture of four dimensional Gaussian distributions. A sequence of primitive actions are discovered in the test video, and labelled by computing the KL divergence between mixtures. The entire video sequence containing the human action, is thus reduced to a simple string, which is matched against similar strings of training videos to classify the action. The string matching is performed by global alignment, using the well-known Needleman-Wunsch algorithm. Experiments reported on multiple human actions data sets, confirm the validity, simplicity, and semantically meaningful nature of the proposed representation. Results obtained are encouraging and comparable to the state of the art.
Show less - Date Issued
- 2011
- Identifier
- CFE0003646, ucf:48836
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003646