Current Search: Da Vitoria Lobo, Niels (x)
View All Items
- Title
- DEPTH FROM DEFOCUSED MOTION.
- Creator
-
Myles, Zarina, da Vitoria Lobo, Niels, University of Central Florida
- Abstract / Description
-
Motion in depth and/or zooming causes defocus blur. This work presents a solution to the problem of using defocus blur and optical flow information to compute depth at points that defocus when they move.We first formulate a novel algorithm which recovers defocus blur and affine parameters simultaneously. Next we formulate a novel relationship (the blur-depth relationship) between defocus blur, relative object depth and three parameters based on camera motion and intrinsic camera parameters.We...
Show moreMotion in depth and/or zooming causes defocus blur. This work presents a solution to the problem of using defocus blur and optical flow information to compute depth at points that defocus when they move.We first formulate a novel algorithm which recovers defocus blur and affine parameters simultaneously. Next we formulate a novel relationship (the blur-depth relationship) between defocus blur, relative object depth and three parameters based on camera motion and intrinsic camera parameters.We can handle the situation where a single image has points which have defocused, got sharper or are focally unperturbed. Moreover, our formulation is valid regardless of whether the defocus is due to the image plane being in front of or behind the point of sharp focus.The blur-depth relationship requires a sequence of at least three images taken with the camera moving either towards or away from the object. It can be used to obtain an initial estimate of relative depth using one of several non-linear methods. We demonstrate a solution based on the Extended Kalman Filter in which the measurement equation is the blur-depth relationship.The estimate of relative depth is then used to compute an initial estimate of camera motion parameters. In order to refine depth values, the values of relative depth and camera motion are then input into a second Extended Kalman Filter in which the measurement equations are the discrete motion equations. This set of cascaded Kalman filters can be employed iteratively over a longer sequence of images in order to further refine depth.We conduct several experiments on real scenery in order to demonstrate the range of object shapes that the algorithm can handle. We show that fairly good estimates of depth can be obtained with just three images.
Show less - Date Issued
- 2004
- Identifier
- CFE0000135, ucf:46179
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000135
- Title
- MARKERLESS TRACKING USING POLAR CORRELATION OF CAMERA OPTICAL FLOW.
- Creator
-
Gupta, Prince, da Vitoria Lobo, Niels, University of Central Florida
- Abstract / Description
-
We present a novel, real-time, markerless vision-based tracking system, employing a rigid orthogonal configuration of two pairs of opposing cameras. Our system uses optical flow over sparse features to overcome the limitation of vision-based systems that require markers or a pre-loaded model of the physical environment. We show how opposing cameras enable cancellation of common components of optical flow leading to an efficient tracking algorithm that captures five degrees of freedom...
Show moreWe present a novel, real-time, markerless vision-based tracking system, employing a rigid orthogonal configuration of two pairs of opposing cameras. Our system uses optical flow over sparse features to overcome the limitation of vision-based systems that require markers or a pre-loaded model of the physical environment. We show how opposing cameras enable cancellation of common components of optical flow leading to an efficient tracking algorithm that captures five degrees of freedom including direction of translation and angular velocity. Experiments comparing our device with an electromagnetic tracker show that its average tracking accuracy is 80% over 185 frames, and it is able to track large range motions even in outdoor settings. We also present how opposing cameras in vision-based inside-looking-out systems can be used for gesture recognition. To demonstrate our approach, we discuss three different algorithms for recovering motion parameters at different levels of complete recovery. We show how optical flow in opposing cameras can be used to recover motion parameters of the multi-camera rig. Experimental results show gesture recognition accuracy of 88.0%, 90.7% and 86.7% for our three techniques, respectively, across a set of 15 gestures.
Show less - Date Issued
- 2010
- Identifier
- CFE0003163, ucf:48611
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003163
- Title
- Visual-Textual Video Synopsis Generation.
- Creator
-
Sharghi Karganroodi, Aidean, Shah, Mubarak, Da Vitoria Lobo, Niels, Rahnavard, Nazanin, Atia, George, University of Central Florida
- Abstract / Description
-
In this dissertation we tackle the problem of automatic video summarization. Automatic summarization techniques enable faster browsing and indexing of large video databases. However, due to the inherent subjectivity of the task, no single video summarizer fits all users unless it adapts to individual user's needs. To address this issue, we introduce a fresh view on the task called "Query-focused'' extractive video summarization. We develop a supervised model that takes as input a video and...
Show moreIn this dissertation we tackle the problem of automatic video summarization. Automatic summarization techniques enable faster browsing and indexing of large video databases. However, due to the inherent subjectivity of the task, no single video summarizer fits all users unless it adapts to individual user's needs. To address this issue, we introduce a fresh view on the task called "Query-focused'' extractive video summarization. We develop a supervised model that takes as input a video and user's preference in form of a query, and creates a summary video by selecting key shots from the original video. We model the problem as subset selection via determinantal point process (DPP), a stochastic point process that assigns a probability value to each subset of any given set. Next, we develop a second model that exploits capabilities of memory networks in the framework and concomitantly reduces the level of supervision required to train the model. To automatically evaluate system summaries, we contend that a good metric for video summarization should focus on the semantic information that humans can perceive rather than the visual features or temporal overlaps. To this end, we collect dense per-video-shot concept annotations, compile a new dataset, and suggest an efficient evaluation method defined upon the concept annotations. To enable better summarization of videos, we improve the sequential DPP in two folds. In terms of learning, we propose a large-margin algorithm to address the exposure bias that is common in many sequence to sequence learning methods. In terms of modeling, we integrate a new probabilistic distribution into SeqDPP, the resulting model accepts user input about the expected length of the summary. We conclude this dissertation by developing a framework to generate textual synopsis for a video, thus, enabling users to quickly browse a large video database without watching the videos.
Show less - Date Issued
- 2019
- Identifier
- CFE0007862, ucf:52756
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007862
- Title
- Gradient based MRF learning for image restoration and segmentation.
- Creator
-
Samuel, Kegan, Tappen, Marshall, Da Vitoria Lobo, Niels, Foroosh, Hassan, Li, Xin, University of Central Florida
- Abstract / Description
-
The undirected graphical model or Markov Random Field (MRF) is one of the more popular models used in computer vision and is the type of model with which this work is concerned. Models based on these methods have proven to be particularly useful in low-level vision systems and have led to state-of-the-art results for MRF-based systems. The research presented will describe a new discriminative training algorithm and its implementation.The MRF model will be trained by optimizing its parameters...
Show moreThe undirected graphical model or Markov Random Field (MRF) is one of the more popular models used in computer vision and is the type of model with which this work is concerned. Models based on these methods have proven to be particularly useful in low-level vision systems and have led to state-of-the-art results for MRF-based systems. The research presented will describe a new discriminative training algorithm and its implementation.The MRF model will be trained by optimizing its parameters so that the minimum energy solution of the model is as similar as possible to the ground-truth. While previous work has relied on time-consuming iterative approximations or stochastic approximations, this work will demonstrate how implicit differentiation can be used to analytically differentiate the overall training loss with respect to the MRF parameters. This framework leads to an efficient, flexible learning algorithm that can be applied to a number of different models.The effectiveness of the proposed learning method will then be demonstrated by learning the parameters of two related models applied to the task of denoising images. The experimental results will demonstrate that the proposed learning algorithm is comparable and, at times, better than previous training methods applied to the same tasks.A new segmentation model will also be introduced and trained using the proposed learning method. The proposed segmentation model is based on an energy minimization framework that is novel in how it incorporates priors on the size of the segments in a way that is straightforward to implement. While other methods, such as normalized cuts, tend to produce segmentations of similar sizes, this method is able to overcome that problem and produce more realistic segmentations.
Show less - Date Issued
- 2012
- Identifier
- CFE0004595, ucf:49207
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004595
- Title
- Learning Hierarchical Representations for Video Analysis Using Deep Learning.
- Creator
-
Yang, Yang, Shah, Mubarak, Sukthankar, Gita, Da Vitoria Lobo, Niels, Stanley, Kenneth, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
With the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non...
Show moreWith the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non-linear transformations of the data, with the goal of yielding more abstract and ultimately more useful representations. The advantages of the deep models are three fold: 1) They learn the features directly from the raw signal in contrast to the hand-designed features. 2) The learning can be unsupervised, which is suitable for large data where labeling all the data is expensive and unpractical. 3) They learn a hierarchy of features one level at a time and the layerwise stacking of feature extraction, this often yields better representations.However, not many deep learning models have been proposed to solve the problems in video analysis, especially videos ``in a wild''. Most of them are either dealing with simple datasets, or limited to the low-level local spatial-temporal feature descriptors for action recognition. Moreover, as the learning algorithms are unsupervised, the learned features preserve generative properties rather than the discriminative ones which are more favorable in the classification tasks. In this context, the thesis makes two major contributions.First, we propose several formulations and extensions of deep learning methods which learn hierarchical representations for three challenging video analysis tasks, including complex event recognition, object detection in videos and measuring action similarity. The proposed methods are extensively demonstrated for each work on the state-of-the-art challenging datasets. Besides learning the low-level local features, higher level representations are further designed to be learned in the context of applications. The data-driven concept representations and sparse representation of the events are learned for complex event recognition; the representations for object body parts and structures are learned for object detection in videos; and the relational motion features and similarity metrics between video pairs are learned simultaneously for action verification.Second, in order to learn discriminative and compact features, we propose a new feature learning method using a deep neural network based on auto encoders. It differs from the existing unsupervised feature learning methods in two ways: first it optimizes both discriminative and generative properties of the features simultaneously, which gives our features a better discriminative ability. Second, our learned features are more compact, while the unsupervised feature learning methods usually learn a redundant set of over-complete features. Extensive experiments with quantitative and qualitative results on the tasks of human detection and action verification demonstrate the superiority of our proposed models.
Show less - Date Issued
- 2013
- Identifier
- CFE0004964, ucf:49593
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004964
- Title
- Prototype Development in General Purpose Representation and Association Machine Using Communication Theory.
- Creator
-
Li, Huihui, Wei, Lei, Rahnavard, Nazanin, Vosoughi, Azadeh, Da Vitoria Lobo, Niels, Wang, Wei, University of Central Florida
- Abstract / Description
-
Biological system study has been an intense research area in neuroscience and cognitive science for decades of years. Biological human brain is created as an intelligent system that integrates various types of sensor information and processes them intelligently. Neurons, as activated brain cells help the brain to make instant and rough decisions. From the 1950s, researchers start attempting to understand the strategies the biological system employs, then eventually translate them into machine...
Show moreBiological system study has been an intense research area in neuroscience and cognitive science for decades of years. Biological human brain is created as an intelligent system that integrates various types of sensor information and processes them intelligently. Neurons, as activated brain cells help the brain to make instant and rough decisions. From the 1950s, researchers start attempting to understand the strategies the biological system employs, then eventually translate them into machine-based algorithms. Modern computers have been developed to meet our need to handle computational tasks which our brains are not capable of performing with precision and speed. While in these existing man-made intelligent systems, most of them are designed for specific purposes. The modern computers solve sophistic problems based on fixed representation and association formats, instead of employing versatile approaches to explore the unsolved problems.Because of the above limitations of the conventional machines, General Purpose Representation and Association Machine (GPRAM) System is proposed to focus on using a versatile approach with hierarchical representation and association structures to do a quick and rough assessment on multitasks. Through lessons learned from neuroscience, error control coding and digital communications, a prototype of GPRAM system by employing (7,4) Hamming codes and short Low-Density Parity Check (LDPC) codes is implemented. Types of learning processes are presented, which prove the capability of GPRAM for handling multitasks.Furthermore, a study of low resolution simple patterns and face images recognition using an Image Processing Unit (IPU) structure for GPRAM system is presented. IPU structure consists of a randomly constructed LDPC code, an iterative decoder, a switch and scaling, and decision devices. All the input images have been severely degraded to mimic human Visual Information Variability (VIV) experienced in human visual system. The numerical results show that 1) IPU can reliably recognize simple pattern images in different shapes and sizes; 2) IPU demonstrates an excellent multi-class recognition performance for the face images with high degradation. Our results are comparable to popular machine learning recognition methods towards images without any quality degradation; 3) A bunch of methods have been discussed for improving IPU recognition performance, e.g. designing various detection and power scaling methods, constructing specific LDPC codes with large minimum girth, etc.Finally, novel methods to optimize M-ary PSK, M-ary DPSK, and dual-ring QAM signaling with non-equal symbol probabilities over AWGN channels are presented. In digital communication systems, MPSK, MDPSK, and dual-ring QAM signaling with equiprobable symbols have been well analyzed and widely used in practice. Inspired by bio-systems, we suggest investigating signaling with non-equiprobable symbol probabilities, since in bio-systems it is highly-unlikely to follow the ideal setting and uniform construction of single type of system. The results show that the optimizing system has lower error probabilities than conventional systems and the improvements are dramatic. Even though the communication systems are used as the testing environment, clearly, our final goal is to extend current communication theory to accommodate or better understand bio-neural information processing systems.
Show less - Date Issued
- 2017
- Identifier
- CFE0006758, ucf:51846
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006758
- Title
- Holistic Representations for Activities and Crowd Behaviors.
- Creator
-
Solmaz, Berkan, Shah, Mubarak, Da Vitoria Lobo, Niels, Jha, Sumit, Ilie, Marcel, Moore, Brian, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of analyzing the activities of people in a variety of scenarios, this is commonly encountered in vision applications. The overarching goal is to devise new representations for the activities, in settings where individuals or a number of people may take a part in specific activities. Different types of activities can be performed by either an individual at the fine level or by several people constituting a crowd at the coarse level. We take into...
Show moreIn this dissertation, we address the problem of analyzing the activities of people in a variety of scenarios, this is commonly encountered in vision applications. The overarching goal is to devise new representations for the activities, in settings where individuals or a number of people may take a part in specific activities. Different types of activities can be performed by either an individual at the fine level or by several people constituting a crowd at the coarse level. We take into account the domain specific information for modeling these activities. The summary of the proposed solutions is presented in the following.The holistic description of videos is appealing for visual detection and classification tasks for several reasons including capturing the spatial relations between the scene components, simplicity, and performance [1, 2, 3]. First, we present a holistic (global) frequency spectrum based descriptor for representing the atomic actions performed by individuals such as: bench pressing, diving, hand waving, boxing, playing guitar, mixing, jumping, horse riding, hula hooping etc. We model and learn these individual actions for classifying complex user uploaded videos. Our method bypasses the detection of interest points, the extraction of local video descriptors and the quantization of local descriptors into a code book; it represents each video sequence as a single feature vector. This holistic feature vector is computed by applying a bank of 3-D spatio-temporal filters on the frequency spectrum of a video sequence; hence it integrates the information about the motion and scene structure. We tested our approach on two of the most challenging datasets, UCF50 [4] and HMDB51 [5], and obtained promising results which demonstrates the robustness and the discriminative power of our holistic video descriptor for classifying videos of various realistic actions.In the above approach, a holistic feature vector of a video clip is acquired by dividing the video into spatio-temporal blocks then concatenating the features of the individual blocks together. However, such a holistic representation blindly incorporates all the video regions regardless of their contribution in classification. Next, we present an approach which improves the performance of the holistic descriptors for activity recognition. In our novel method, we improve the holistic descriptors by discovering the discriminative video blocks. We measure the discriminativity of a block by examining its response to a pre-learned support vector machine model. In particular, a block is considered discriminative if it responds positively for positive training samples, and negatively for negative training samples. We pose the problem of finding the optimal blocks as a problem of selecting a sparse set of blocks, which maximizes the total classifier discriminativity. Through a detailed set of experiments on benchmark datasets [6, 7, 8, 9, 5, 10], we show that our method discovers the useful regions in the videos and eliminates the ones which are confusing for classification, which results in significant performance improvement over the state-of-the-art.In contrast to the scenes where an individual performs a primitive action, there may be scenes with several people, where crowd behaviors may take place. For these types of scenes the traditional approaches for recognition will not work due to severe occlusion and computational requirements. The number of videos is limited and the scenes are complicated, hence learning these behaviors is not feasible. For this problem, we present a novel approach, based on the optical flow in a video sequence, for identifying five specific and common crowd behaviors in visual scenes. In the algorithm, the scene is overlaid by a grid of particles, initializing a dynamical system which is derived from the optical flow. Numerical integration of the optical flow provides particle trajectories that represent the motion in the scene. Linearization of the dynamical system allows a simple and practical analysis and classification of the behavior through the Jacobian matrix. Essentially, the eigenvalues of this matrix are used to determine the dynamic stability of points in the flow and each type of stability corresponds to one of the five crowd behaviors. The identified crowd behaviors are (1) bottlenecks: where many pedestrians/vehicles from various points in the scene are entering through one narrow passage, (2) fountainheads: where many pedestrians/vehicles are emerging from a narrow passage only to separate in many directions, (3) lanes: where many pedestrians/vehicles are moving at the same speeds in the same direction, (4) arches or rings: where the collective motion is curved or circular, and (5) blocking: where there is a opposing motion and desired movement of groups of pedestrians is somehow prohibited. The implementation requires identifying a region of interest in the scene, and checking the eigenvalues of the Jacobian matrix in that region to determine the type of flow, that corresponds to various well-defined crowd behaviors. The eigenvalues are only considered in these regions of interest, consistent with the linear approximation and the implied behaviors. Since changes in eigenvalues can mean changes in stability, corresponding to changes in behavior, we can repeat the algorithm over clips of long video sequences to locate changes in behavior. This method was tested on over real videos representing crowd and traffic scenes.
Show less - Date Issued
- 2013
- Identifier
- CFE0004941, ucf:49638
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004941
- Title
- Exploring 3D User Interface Technologies for Improving the Gaming Experience.
- Creator
-
Kulshreshth, Arun, Laviola II, Joseph, Hughes, Charles, Da Vitoria Lobo, Niels, Masuch, Maic, University of Central Florida
- Abstract / Description
-
3D user interface technologies have the potential to make games more immersive (&) engaging and thus potentially provide a better user experience to gamers. Although 3D user interface technologies are available for games, it is still unclear how their usage affects game play and if there are any user performance benefits. A systematic study of these technologies in game environments is required to understand how game play is affected and how we can optimize the usage in order to achieve...
Show more3D user interface technologies have the potential to make games more immersive (&) engaging and thus potentially provide a better user experience to gamers. Although 3D user interface technologies are available for games, it is still unclear how their usage affects game play and if there are any user performance benefits. A systematic study of these technologies in game environments is required to understand how game play is affected and how we can optimize the usage in order to achieve better game play experience.This dissertation seeks to improve the gaming experience by exploring several 3DUI technologies. In this work, we focused on stereoscopic 3D viewing (to improve viewing experience) coupled with motion based control, head tracking (to make games more engaging), and faster gesture based menu selection (to reduce cognitive burden associated with menu interaction while playing). We first studied each of these technologies in isolation to understand their benefits for games. We present the results of our experiments to evaluate benefits of stereoscopic 3D (when coupled with motion based control) and head tracking in games. We discuss the reasons behind these findings and provide recommendations for game designers who want to make use of these technologies to enhance gaming experiences. We also present the results of our experiments with finger-based menu selection techniques with an aim to find out the fastest technique. Based on these findings, we custom designed an air-combat game prototype which simultaneously uses stereoscopic 3D, head tracking, and finger-count shortcuts to prove that these technologies could be useful for games if the game is designed with these technologies in mind. Additionally, to enhance depth discrimination and minimize visual discomfort, the game dynamically optimizes stereoscopic 3D parameters (convergence and separation) based on the user's look direction. We conducted a within subjects experiment where we examined performance data and self-reported data on users perception of the game. Our results indicate that participants performed significantly better when all the 3DUI technologies (stereoscopic 3D, head-tracking and finger-count gestures) were available simultaneously with head tracking as a dominant factor. We explore the individual contribution of each of these technologies to the overall gaming experience and discuss the reasons behind our findings.Our experiments indicate that 3D user interface technologies could make gaming experience better if used effectively. The games must be designed to make use of the 3D user interface technologies available in order to provide a better gaming experience to the user. We explored a few technologies as part of this work and obtained some design guidelines for future game designers. We hope that our work will serve as the framework for the future explorations of making games better using 3D user interface technologies.
Show less - Date Issued
- 2015
- Identifier
- CFE0005643, ucf:50190
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005643
- Title
- Visual Analysis of Extremely Dense Crowded Scenes.
- Creator
-
Idrees, Haroon, Shah, Mubarak, Da Vitoria Lobo, Niels, Stanley, Kenneth, Atia, George, Saleh, Bahaa, University of Central Florida
- Abstract / Description
-
Visual analysis of dense crowds is particularly challenging due to large number of individuals, occlusions, clutter, and fewer pixels per person which rarely occur in ordinary surveillance scenarios. This dissertation aims to address these challenges in images and videos of extremely dense crowds containing hundreds to thousands of humans. The goal is to tackle the fundamental problems of counting, detecting and tracking people in such images and videos using visual and contextual cues that...
Show moreVisual analysis of dense crowds is particularly challenging due to large number of individuals, occlusions, clutter, and fewer pixels per person which rarely occur in ordinary surveillance scenarios. This dissertation aims to address these challenges in images and videos of extremely dense crowds containing hundreds to thousands of humans. The goal is to tackle the fundamental problems of counting, detecting and tracking people in such images and videos using visual and contextual cues that are automatically derived from the crowded scenes.For counting in an image of extremely dense crowd, we propose to leverage multiple sources of information to compute an estimate of the number of individuals present in the image. Our approach relies on sources such as low confidence head detections, repetition of texture elements (using SIFT), and frequency-domain analysis to estimate counts, along with confidence associated with observing individuals, in an image region. Furthermore, we employ a global consistency constraint on counts using Markov Random Field which caters for disparity in counts in local neighborhoods and across scales. We tested this approach on crowd images with the head counts ranging from 94 to 4543 and obtained encouraging results. Through this approach, we are able to count people in images of high-density crowds unlike previous methods which are only applicable to videos of low to medium density crowded scenes. However, the counting procedure just outputs a single number for a large patch or an entire image. With just the counts, it becomes difficult to measure the counting error for a query image with unknown number of people. For this, we propose to localize humans by finding repetitive patterns in the crowd image. Starting with detections from an underlying head detector, we correlate them within the image after their selection through several criteria: in a pre-defined grid, locally, or at multiple scales by automatically finding the patches that are most representative of recurring patterns in the crowd image. Finally, the set of generated hypotheses is selected using binary integer quadratic programming with Special Ordered Set (SOS) Type 1 constraints.Human Detection is another important problem in the analysis of crowded scenes where the goal is to place a bounding box on visible parts of individuals. Primarily applicable to images depicting medium to high density crowds containing several hundred humans, it is a crucial pre-requisite for many other visual tasks, such as tracking, action recognition or detection of anomalous behaviors, exhibited by individuals in a dense crowd. For detecting humans, we explore context in dense crowds in the form of locally-consistent scale prior which captures the similarity in scale in local neighborhoods with smooth variation over the image. Using the scale and confidence of detections obtained from an underlying human detector, we infer scale and confidence priors using Markov Random Field. In an iterative mechanism, the confidences of detections are modified to reflect consistency with the inferred priors, and the priors are updated based on the new detections. The final set of detections obtained are then reasoned for occlusion using Binary Integer Programming where overlaps and relations between parts of individuals are encoded as linear constraints. Both human detection and occlusion reasoning in this approach are solved with local neighbor-dependent constraints, thereby respecting the inter-dependence between individuals characteristic to dense crowd analysis. In addition, we propose a mechanism to detect different combinations of body parts without requiring annotations for individual combinations.Once human detection and localization is performed, we then use it for tracking people in dense crowds. Similar to the use of context as scale prior for human detection, we exploit it in the form of motion concurrence for tracking individuals in dense crowds. The proposed method for tracking provides an alternative and complementary approach to methods that require modeling of crowd flow. Simultaneously, it is less likely to fail in the case of dynamic crowd flows and anomalies by minimally relying on previous frames. The approach begins with the automatic identification of prominent individuals from the crowd that are easy to track. Then, we use Neighborhood Motion Concurrence to model the behavior of individuals in a dense crowd, this predicts the position of an individual based on the motion of its neighbors. When the individual moves with the crowd flow, we use Neighborhood Motion Concurrence to predict motion while leveraging five-frame instantaneous flow in case of dynamically changing flow and anomalies. All these aspects are then embedded in a framework which imposes hierarchy on the order in which positions of individuals are updated. The results are reported on eight sequences of medium to high density crowds and our approach performs on par with existing approaches without learning or modeling patterns of crowd flow.We experimentally demonstrate the efficacy and reliability of our algorithms by quantifying the performance of counting, localization, as well as human detection and tracking on new and challenging datasets containing hundreds to thousands of humans in a given scene.
Show less - Date Issued
- 2014
- Identifier
- CFE0005508, ucf:50367
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005508
- Title
- Taming Wild Faces: Web-Scale, Open-Universe Face Identification in Still and Video Imagery.
- Creator
-
Ortiz, Enrique, Shah, Mubarak, Sukthankar, Rahul, Da Vitoria Lobo, Niels, Wang, Jun, Li, Xin, University of Central Florida
- Abstract / Description
-
With the increasing pervasiveness of digital cameras, the Internet, and social networking, there is a growing need to catalog and analyze large collections of photos and videos. In this dissertation, we explore unconstrained still-image and video-based face recognition in real-world scenarios, e.g. social photo sharing and movie trailers, where people of interest are recognized and all others are ignored. In such a scenario, we must obtain high precision in recognizing the known identities,...
Show moreWith the increasing pervasiveness of digital cameras, the Internet, and social networking, there is a growing need to catalog and analyze large collections of photos and videos. In this dissertation, we explore unconstrained still-image and video-based face recognition in real-world scenarios, e.g. social photo sharing and movie trailers, where people of interest are recognized and all others are ignored. In such a scenario, we must obtain high precision in recognizing the known identities, while accurately rejecting those of no interest.Recent advancements in face recognition research has seen Sparse Representation-based Classification (SRC) advance to the forefront of competing methods. However, its drawbacks, slow speed and sensitivity to variations in pose, illumination, and occlusion, have hindered its wide-spread applicability. The contributions of this dissertation are three-fold: 1. For still-image data, we propose a novel Linearly Approximated Sparse Representation-based Classification (LASRC) algorithm that uses linear regression to perform sample selection for l1-minimization, thus harnessing the speed of least-squares and the robustness of SRC. On our large dataset collected from Facebook, LASRC performs equally to standard SRC with a speedup of 100-250x.2. For video, applying the popular l1-minimization for face recognition on a frame-by-frame basis is prohibitively expensive computationally, so we propose a new algorithm Mean Sequence SRC (MSSRC) that performs video face recognition using a joint optimization leveraging all of the available video data and employing the knowledge that the face track frames belong to the same individual. Employing MSSRC results in a speedup of 5x on average over SRC on a frame-by-frame basis.3. Finally, we make the observation that MSSRC sometimes assigns inconsistent identities to the same individual in a scene that could be corrected based on their visual similarity. Therefore, we construct a probabilistic affinity graph combining appearance and co-occurrence similarities to model the relationship between face tracks in a video. Using this relationship graph, we employ random walk analysis to propagate strong class predictions among similar face tracks, while dampening weak predictions. Our method results in a performance gain of 15.8% in average precision over using MSSRC alone.
Show less - Date Issued
- 2014
- Identifier
- CFE0005536, ucf:50313
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005536
- Title
- Robust Subspace Estimation Using Low-Rank Optimization. Theory and Applications in Scene Reconstruction, Video Denoising, and Activity Recognition.
- Creator
-
Oreifej, Omar, Shah, Mubarak, Da Vitoria Lobo, Niels, Stanley, Kenneth, Lin, Mingjie, Li, Xin, University of Central Florida
- Abstract / Description
-
In this dissertation, we discuss the problem of robust linear subspace estimation using low-rank optimization and propose three formulations of it. We demonstrate how these formulations can be used to solve fundamental computer vision problems, and provide superior performance in terms of accuracy and running time.Consider a set of observations extracted from images (such as pixel gray values, local features, trajectories...etc). If the assumption that these observations are drawn from a...
Show moreIn this dissertation, we discuss the problem of robust linear subspace estimation using low-rank optimization and propose three formulations of it. We demonstrate how these formulations can be used to solve fundamental computer vision problems, and provide superior performance in terms of accuracy and running time.Consider a set of observations extracted from images (such as pixel gray values, local features, trajectories...etc). If the assumption that these observations are drawn from a liner subspace (or can be linearly approximated) is valid, then the goal is to represent each observation as a linear combination of a compact basis, while maintaining a minimal reconstruction error. One of the earliest, yet most popular, approaches to achieve that is Principal Component Analysis (PCA). However, PCA can only handle Gaussian noise, and thus suffers when the observations are contaminated with gross and sparse outliers. To this end, in this dissertation, we focus on estimating the subspace robustly using low-rank optimization, where the sparse outliers are detected and separated through the `1 norm. The robust estimation has a two-fold advantage: First, the obtained basis better represents the actual subspace because it does not include contributions from the outliers. Second, the detected outliers are often of a specific interest in many applications, as we will show throughout this thesis. We demonstrate four different formulations and applications for low-rank optimization. First, we consider the problem of reconstructing an underwater sequence by removing the turbulence caused by the water waves. The main drawback of most previous attempts to tackle this problem is that they heavily depend on modelling the waves, which in fact is ill-posed since the actual behavior of the waves along with the imaging process are complicated and include several noise components; therefore, their results are not satisfactory. In contrast, we propose a novel approach which outperforms the state-of-the-art. The intuition behind our method is that in a sequence where the water is static, the frames would be linearly correlated. Therefore, in the presence of water waves, we may consider the frames as noisy observations drawn from a the subspace of linearly correlated frames. However, the noise introduced by the water waves is not sparse, and thus cannot directly be detected using low-rank optimization. Therefore, we propose a data-driven two-stage approach, where the first stage (")sparsifies(") the noise, and the second stage detects it. The first stage leverages the temporal mean of the sequence to overcome the structured turbulence of the waves through an iterative registration algorithm. The result of the first stage is a high quality mean and a better structured sequence; however, the sequence still contains unstructured sparse noise. Thus, we employ a second stage at which we extract the sparse errors from the sequence through rank minimization. Our method converges faster, and drastically outperforms state of the art on all testing sequences. Secondly, we consider a closely related situation where an independently moving object is also present in the turbulent video. More precisely, we consider video sequences acquired in a desert battlefields, where atmospheric turbulence is typically present, in addition to independently moving targets. Typical approaches for turbulence mitigation follow averaging or de-warping techniques. Although these methods can reduce the turbulence, they distort the independently moving objects which can often be of great interest. Therefore, we address the problem of simultaneous turbulence mitigation and moving object detection. We propose a novel three-term low-rank matrix decomposition approach in which we decompose the turbulence sequence into three components: the background, the turbulence, and the object. We simplify this extremely difficult problem into a minimization of nuclear norm, Frobenius norm, and L1 norm. Our method is based on two observations: First, the turbulence causes dense and Gaussian noise, and therefore can be captured by Frobenius norm, while the moving objects are sparse and thus can be captured by L1 norm. Second, since the object's motion is linear and intrinsically different than the Gaussian-like turbulence, a Gaussian-based turbulence model can be employed to enforce an additional constraint on the search space of the minimization. We demonstrate the robustness of our approach on challenging sequences which are significantly distorted with atmospheric turbulence and include extremely tiny moving objects. In addition to robustly detecting the subspace of the frames of a sequence, we consider using trajectories as observations in the low-rank optimization framework. In particular, in videos acquired by moving cameras, we track all the pixels in the video and use that to estimate the camera motion subspace. This is particularly useful in activity recognition, which typically requires standard preprocessing steps such as motion compensation, moving object detection, and object tracking. The errors from the motion compensation step propagate to the object detection stage, resulting in miss-detections, which further complicates the tracking stage, resulting in cluttered and incorrect tracks. In contrast, we propose a novel approach which does not follow the standard steps, and accordingly avoids the aforementioned difficulties. Our approach is based on Lagrangian particle trajectories which are a set of dense trajectories obtained by advecting optical flow over time, thus capturing the ensemble motions of a scene. This is done in frames of unaligned video, and no object detection is required. In order to handle the moving camera, we decompose the trajectories into their camera-induced and object-induced components. Having obtained the relevant object motion trajectories, we compute a compact set of chaotic invariant features, which captures the characteristics of the trajectories. Consequently, a SVM is employed to learn and recognize the human actions using the computed motion features. We performed intensive experiments on multiple benchmark datasets, and obtained promising results.Finally, we consider a more challenging problem referred to as complex event recognition, where the activities of interest are complex and unconstrained. This problem typically pose significant challenges because it involves videos of highly variable content, noise, length, frame size ... etc. In this extremely challenging task, high-level features have recently shown a promising direction as in [53, 129], where core low-level events referred to as concepts are annotated and modeled using a portion of the training data, then each event is described using its content of these concepts. However, because of the complex nature of the videos, both the concept models and the corresponding high-level features are significantly noisy. In order to address this problem, we propose a novel low-rank formulation, which combines the precisely annotated videos used to train the concepts, with the rich high-level features. Our approach finds a new representation for each event, which is not only low-rank, but also constrained to adhere to the concept annotation, thus suppressing the noise, and maintaining a consistent occurrence of the concepts in each event. Extensive experiments on large scale real world dataset TRECVID Multimedia Event Detection 2011 and 2012 demonstrate that our approach consistently improves the discriminativity of the high-level features by a significant margin.
Show less - Date Issued
- 2013
- Identifier
- CFE0004732, ucf:49835
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004732
- Title
- Understanding images and videos using context.
- Creator
-
Vaca Castano, Gonzalo, Da Vitoria Lobo, Niels, Shah, Mubarak, Mikhael, Wasfy, Jones, W Linwood, Wiegand, Rudolf, University of Central Florida
- Abstract / Description
-
In computer vision, context refers to any information that may influence how visual media are understood.(&)nbsp; Traditionally, researchers have studied the influence of several sources of context in relation to the object detection problem in images. In this dissertation, we present a multifaceted review of the problem of context.(&)nbsp; Context is analyzed as a source of improvement in the object detection problem, not only in images but also in videos. In the case of images, we also...
Show moreIn computer vision, context refers to any information that may influence how visual media are understood.(&)nbsp; Traditionally, researchers have studied the influence of several sources of context in relation to the object detection problem in images. In this dissertation, we present a multifaceted review of the problem of context.(&)nbsp; Context is analyzed as a source of improvement in the object detection problem, not only in images but also in videos. In the case of images, we also investigate the influence of the semantic context, determined by objects, relationships, locations, and global composition, to achieve a general understanding of the image content as a whole. In our research, we also attempt to solve the related problem of finding the context associated with visual media. Given a set of visual elements (images), we want to extract the context that can be commonly associated with these images in order to remove ambiguity. The first part of this dissertation concentrates on achieving image understanding using semantic context.(&)nbsp; In spite of the recent success in tasks such as image classi?cation, object detection, image segmentation, and the progress on scene understanding, researchers still lack clarity about computer comprehension of the content of the image as a whole. Hence, we propose a Top-Down Visual Tree (TDVT) image representation that allows the encoding of the content of the image as a hierarchy of objects capturing their importance, co-occurrences, and type of relations. A novel Top-Down Tree LSTM network is presented to learn about the image composition from the training images and their TDVT representations. Given a test image, our algorithm detects objects and determine the hierarchical structure that they form, encoded as a TDVT representation of the image.A single image could have multiple interpretations that may lead to ambiguity about the intentionality of an image.(&)nbsp; What if instead of having only a single image to be interpreted, we have multiple images that represent the same topic. The second part of this dissertation covers how to extract the context information shared by multiple images. We present a method to determine the topic that these images represent. We accomplish this task by transferring tags from an image retrieval database, and by performing operations in the textual space of these tags. As an application, we also present a new image retrieval method that uses multiple images as input. Unlike earlier works that focus either on using just a single query image or using multiple query images with views of the same instance, the new image search paradigm retrieves images based on the underlying concepts that the input images represent.Finally, in the third part of this dissertation, we analyze the influence of context in videos. In this case, the temporal context is utilized to improve scene identification and object detection. We focus on egocentric videos, where agents require some time to change from one location to another. Therefore, we propose a Conditional Random Field (CRF) formulation, which penalizes short-term changes of the scene identity to improve the scene identity accuracy.(&)nbsp; We also show how to improve the object detection outcome by re-scoring the results based on the scene identity of the tested frame. We present a Support Vector Regression (SVR) formulation in the case that explicit knowledge of the scene identity is available during training time. In the case that explicit scene labeling is not available, we propose an LSTM formulation that considers the general appearance of the frame to re-score the object detectors.
Show less - Date Issued
- 2017
- Identifier
- CFE0006922, ucf:51703
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006922