Current Search: Shah, Mubarak (x)
View All Items
Pages
- Title
- VIDEO CONTENT EXTRACTION: SCENE SEGMENTATION, LINKING AND ATTENTION DETECTION.
- Creator
-
Zhai, Yun, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this fast paced digital age, a vast amount of videos are produced every day, such as movies, TV programs, personal home videos, surveillance video, etc. This places a high demand for effective video data analysis and management techniques. In this dissertation, we have developed new techniques for segmentation, linking and understanding of video scenes. Firstly, we have developed a video scene segmentation framework that segments the video content into story units. Then, a linking method...
Show moreIn this fast paced digital age, a vast amount of videos are produced every day, such as movies, TV programs, personal home videos, surveillance video, etc. This places a high demand for effective video data analysis and management techniques. In this dissertation, we have developed new techniques for segmentation, linking and understanding of video scenes. Firstly, we have developed a video scene segmentation framework that segments the video content into story units. Then, a linking method is designed to find the semantic correlation between video scenes/stories. Finally, to better understand the video content, we have developed a spatiotemporal attention detection model for videos. Our general framework for temporal scene segmentation, which is applicable to several video domains, is formulated in a statistical fashion and uses the Markov chain Monte Carlo (MCMC) technique to determine the boundaries between video scenes. In this approach, a set of arbitrary scene boundaries are initialized at random locations and are further automatically updated using two types of updates: diffusion and jumps. The posterior probability of the target distribution of the number of scenes and their corresponding boundary locations are computed based on the model priors and the data likelihood. Model parameter updates are controlled by the MCMC hypothesis ratio test, and samples are collected to generate the final scene boundaries. The major contribution of the proposed framework is two-fold: (1) it is able to find weak boundaries as well as strong boundaries, i.e., it does not rely on the fixed threshold; (2) it can be applied to different video domains. We have tested the proposed method on two video domains: home videos and feature films. On both of these domains we have obtained very accurate results, achieving on the average of 86% precision and 92% recall for home video segmentation, and 83% precision and 83% recall for feature films. The video scene segmentation process divides videos into meaningful units. These segments (or stories) can be further organized into clusters based on their content similarities. In the second part of this dissertation, we have developed a novel concept tracking method, which links news stories that focus on the same topic across multiple sources. The semantic linkage between the news stories is reflected in the combination of both their visual content and speech content. Visually, each news story is represented by a set of key frames, which may or may not contain human faces. The facial key frames are linked based on the analysis of the extended facial regions, and the non-facial key frames are correlated using the global matching. The textual similarity of the stories is expressed in terms of the normalized textual similarity between the keywords in the speech content of the stories. The developed framework has also been applied to the task of story ranking, which computes the interestingness of the stories. The proposed semantic linking framework and the story ranking method have both been tested on a set of 60 hours of open-benchmark video data (CNN and ABC news) from the TRECVID 2003 evaluation forum organized by NIST. Above 90% system precision has been achieved for the story linking task. The combination of both visual and speech cues has boosted the un-normalized recall by 15%. We have developed PEGASUS, a content based video retrieval system with fast speech and visual feature indexing and search. The system is available on the web: http://pegasus.cs.ucf.edu:8080/index.jsp. Given a video sequence, one important task is to understand what is present or what is happening in its content. To achieve this goal, target objects or activities need to be detected, localized and recognized in either the spatial and/or temporal domain. In the last portion of this dissertation, we present a visual attention detection method, which automatically generates the spatiotemporal saliency maps of input video sequences. The saliency map is later used in the detections of interesting objects and activities in videos by significantly narrowing the search range. Our spatiotemporal visual attention model generates the saliency maps based on both the spatial and temporal signals in the video sequences. In the temporal attention model, motion contrast is computed based on the planar motions (homography) between images, which are estimated by applying RANSAC on point correspondences in the scene. To compensate for the non-uniformity of the spatial distribution of interest-points, spanning areas of motion segments are incorporated in the motion contrast computation. In the spatial attention model, we have developed a fast method for computing pixel-level saliency maps using color histograms of images. Finally, a dynamic fusion technique is applied to combine both the temporal and spatial saliency maps, where temporal attention is dominant over the spatial model when large motion contrast exists, and vice versa. The proposed spatiotemporal attention framework has been extensively applied on multiple video sequences to highlight interesting objects and motions present in the sequences. We have achieved 82% user satisfactory rate on the point-level attention detection and over 92% user satisfactory rate on the object-level attention detection.
Show less - Date Issued
- 2006
- Identifier
- CFE0001216, ucf:46944
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001216
- Title
- SCENE MONITORING WITH A FOREST OF COOPERATIVE SENSORS.
- Creator
-
Javed, Omar, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this dissertation, we present vision based scene interpretation methods for monitoring of people and vehicles, in real-time, within a busy environment using a forest of co-operative electro-optical (EO) sensors. We have developed novel video understanding algorithms with learning capability, to detect and categorize people and vehicles, track them with in a camera and hand-off this information across multiple networked cameras for multi-camera tracking. The ability to learn prevents the...
Show moreIn this dissertation, we present vision based scene interpretation methods for monitoring of people and vehicles, in real-time, within a busy environment using a forest of co-operative electro-optical (EO) sensors. We have developed novel video understanding algorithms with learning capability, to detect and categorize people and vehicles, track them with in a camera and hand-off this information across multiple networked cameras for multi-camera tracking. The ability to learn prevents the need for extensive manual intervention, site models and camera calibration, and provides adaptability to changing environmental conditions. For object detection and categorization in the video stream, a two step detection procedure is used. First, regions of interest are determined using a novel hierarchical background subtraction algorithm that uses color and gradient information for interest region detection. Second, objects are located and classified from within these regions using a weakly supervised learning mechanism based on co-training that employs motion and appearance features. The main contribution of this approach is that it is an online procedure in which separate views (features) of the data are used for co-training, while the combined view (all features) is used to make classification decisions in a single boosted framework. The advantage of this approach is that it requires only a few initial training samples and can automatically adjust its parameters online to improve the detection and classification performance. Once objects are detected and classified they are tracked in individual cameras. Single camera tracking is performed using a voting based approach that utilizes color and shape cues to establish correspondence in individual cameras. The tracker has the capability to handle multiple occluded objects. Next, the objects are tracked across a forest of cameras with non-overlapping views. This is a hard problem because of two reasons. First, the observations of an object are often widely separated in time and space when viewed from non-overlapping cameras. Secondly, the appearance of an object in one camera view might be very different from its appearance in another camera view due to the differences in illumination, pose and camera properties. To deal with the first problem, the system learns the inter-camera relationships to constrain track correspondences. These relationships are learned in the form of multivariate probability density of space-time variables (object entry and exit locations, velocities, and inter-camera transition times) using Parzen windows. To handle the appearance change of an object as it moves from one camera to another, we show that all color transfer functions from a given camera to another camera lie in a low dimensional subspace. The tracking algorithm learns this subspace by using probabilistic principal component analysis and uses it for appearance matching. The proposed system learns the camera topology and subspace of inter-camera color transfer functions during a training phase. Once the training is complete, correspondences are assigned using the maximum a posteriori (MAP) estimation framework using both the location and appearance cues. Extensive experiments and deployment of this system in realistic scenarios has demonstrated the robustness of the proposed methods. The proposed system was able to detect and classify targets, and seamlessly tracked them across multiple cameras. It also generated a summary in terms of key frames and textual description of trajectories to a monitoring officer for final analysis and response decision. This level of interpretation was the goal of our research effort, and we believe that it is a significant step forward in the development of intelligent systems that can deal with the complexities of real world scenarios.
Show less - Date Issued
- 2005
- Identifier
- CFE0000497, ucf:46362
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000497
- Title
- IMAGE BASED VIEW SYNTHESIS.
- Creator
-
Xiao, Jiangjian, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
This dissertation deals with the image-based approach to synthesize a virtual scene using sparse images or a video sequence without the use of 3D models. In our scenario, a real dynamic or static scene is captured by a set of un-calibrated images from different viewpoints. After automatically recovering the geometric transformations between these images, a series of photo-realistic virtual views can be rendered and a virtual environment covered by these several static cameras can be...
Show moreThis dissertation deals with the image-based approach to synthesize a virtual scene using sparse images or a video sequence without the use of 3D models. In our scenario, a real dynamic or static scene is captured by a set of un-calibrated images from different viewpoints. After automatically recovering the geometric transformations between these images, a series of photo-realistic virtual views can be rendered and a virtual environment covered by these several static cameras can be synthesized. This image-based approach has applications in object recognition, object transfer, video synthesis and video compression. In this dissertation, I have contributed to several sub-problems related to image based view synthesis. Before image-based view synthesis can be performed, images need to be segmented into individual objects. Assuming that a scene can approximately be described by multiple planar regions, I have developed a robust and novel approach to automatically extract a set of affine or projective transformations induced by these regions, correctly detect the occlusion pixels over multiple consecutive frames, and accurately segment the scene into several motion layers. First, a number of seed regions using correspondences in two frames are determined, and the seed regions are expanded and outliers are rejected employing the graph cuts method integrated with level set representation. Next, these initial regions are merged into several initial layers according to the motion similarity. Third, the occlusion order constraints on multiple frames are explored, which guarantee that the occlusion area increases with the temporal order in a short period and effectively maintains segmentation consistency over multiple consecutive frames. Then the correct layer segmentation is obtained by using a graph cuts algorithm, and the occlusions between the overlapping layers are explicitly determined. Several experimental results are demonstrated to show that our approach is effective and robust. Recovering the geometrical transformations among images of a scene is a prerequisite step for image-based view synthesis. I have developed a wide baseline matching algorithm to identify the correspondences between two un-calibrated images, and to further determine the geometric relationship between images, such as epipolar geometry or projective transformation. In our approach, a set of salient features, edge-corners, are detected to provide robust and consistent matching primitives. Then, based on the Singular Value Decomposition (SVD) of an affine matrix, we effectively quantize the search space into two independent subspaces for rotation angle and scaling factor, and then we use a two-stage affine matching algorithm to obtain robust matches between these two frames. The experimental results on a number of wide baseline images strongly demonstrate that our matching method outperforms the state-of-art algorithms even under the significant camera motion, illumination variation, occlusion, and self-similarity. Given the wide baseline matches among images I have developed a novel method for Dynamic view morphing. Dynamic view morphing deals with the scenes containing moving objects in presence of camera motion. The objects can be rigid or non-rigid, each of them can move in any orientation or direction. The proposed method can generate a series of continuous and physically accurate intermediate views from only two reference images without any knowledge about 3D. The procedure consists of three steps: segmentation, morphing and post-warping. Given a boundary connection constraint, the source and target scenes are segmented into several layers for morphing. Based on the decomposition of affine transformation between corresponding points, we uniquely determine a physically correct path for post-warping by the least distortion method. I have successfully generalized the dynamic scene synthesis problem from the simple scene with only rotation to the dynamic scene containing non-rigid objects. My method can handle dynamic rigid or non-rigid objects, including complicated objects such as humans. Finally, I have also developed a novel algorithm for tri-view morphing. This is an efficient image-based method to navigate a scene based on only three wide-baseline un-calibrated images without the explicit use of a 3D model. After automatically recovering corresponding points between each pair of images using our wide baseline matching method, an accurate trifocal plane is extracted from the trifocal tensor implied in these three images. Next, employing a trinocular-stereo algorithm and barycentric blending technique, we generate an arbitrary novel view to navigate the scene in a 2D space. Furthermore, after self-calibration of the cameras, a 3D model can also be correctly augmented into this virtual environment synthesized by the tri-view morphing algorithm. We have applied our view morphing framework to several interesting applications: 4D video synthesis, automatic target recognition, multi-view morphing.
Show less - Date Issued
- 2004
- Identifier
- CFE0000218, ucf:46276
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000218
- Title
- MULTI-VIEW APPROACHES TO TRACKING, 3D RECONSTRUCTION AND OBJECT CLASS DETECTION.
- Creator
-
khan, saad, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Multi-camera systems are becoming ubiquitous and have found application in a variety of domains including surveillance, immersive visualization, sports entertainment and movie special effects amongst others. From a computer vision perspective, the challenging task is how to most efficiently fuse information from multiple views in the absence of detailed calibration information and a minimum of human intervention. This thesis presents a new approach to fuse foreground likelihood information...
Show moreMulti-camera systems are becoming ubiquitous and have found application in a variety of domains including surveillance, immersive visualization, sports entertainment and movie special effects amongst others. From a computer vision perspective, the challenging task is how to most efficiently fuse information from multiple views in the absence of detailed calibration information and a minimum of human intervention. This thesis presents a new approach to fuse foreground likelihood information from multiple views onto a reference view without explicit processing in 3D space, thereby circumventing the need for complete calibration. Our approach uses a homographic occupancy constraint (HOC), which states that if a foreground pixel has a piercing point that is occupied by foreground object, then the pixel warps to foreground regions in every view under homographies induced by the reference plane, in effect using cameras as occupancy detectors. Using the HOC we are able to resolve occlusions and robustly determine ground plane localizations of the people in the scene. To find tracks we obtain ground localizations over a window of frames and stack them creating a space time volume. Regions belonging to the same person form contiguous spatio-temporal tracks that are clustered using a graph cuts segmentation approach. Second, we demonstrate that the HOC is equivalent to performing visual hull intersection in the image-plane, resulting in a cross-sectional slice of the object. The process is extended to multiple planes parallel to the reference plane in the framework of plane to plane homologies. Slices from multiple planes are accumulated and the 3D structure of the object is segmented out. Unlike other visual hull based approaches that use 3D constructs like visual cones, voxels or polygonal meshes requiring calibrated views, ours is purely-image based and uses only 2D constructs i.e. planar homographies between views. This feature also renders it conducive to graphics hardware acceleration. The current GPU implementation of our approach is capable of fusing 60 views (480x720 pixels) at the rate of 50 slices/second. We then present an extension of this approach to reconstructing non-rigid articulated objects from monocular video sequences. The basic premise is that due to motion of the object, scene occupancies are blurred out with non-occupancies in a manner analogous to motion blurred imagery. Using our HOC and a novel construct: the temporal occupancy point (TOP), we are able to fuse multiple views of non-rigid objects obtained from a monocular video sequence. The result is a set of blurred scene occupancy images in the corresponding views, where the values at each pixel correspond to the fraction of total time duration that the pixel observed an occupied scene location. We then use a motion de-blurring approach to de-blur the occupancy images and obtain the 3D structure of the non-rigid object. In the final part of this thesis, we present an object class detection method employing 3D models of rigid objects constructed using the above 3D reconstruction approach. Instead of using a complicated mechanism for relating multiple 2D training views, our approach establishes spatial connections between these views by mapping them directly to the surface of a 3D model. To generalize the model for object class detection, features from supplemental views (obtained from Google Image search) are also considered. Given a 2D test image, correspondences between the 3D feature model and the testing view are identified by matching the detected features. Based on the 3D locations of the corresponding features, several hypotheses of viewing planes can be made. The one with the highest confidence is then used to detect the object using feature location matching. Performance of the proposed method has been evaluated by using the PASCAL VOC challenge dataset and promising results are demonstrated.
Show less - Date Issued
- 2008
- Identifier
- CFE0002073, ucf:47593
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002073
- Title
- SEMANTIC VIDEO RETRIEVAL USING HIGH LEVEL CONTEXT.
- Creator
-
Aytar, Yusuf, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Video retrieval searching and retrieving videos relevant to a user defined query is one of the most popular topics in both real life applications and multimedia research. This thesis employs concepts from Natural Language Understanding in solving the video retrieval problem. Our main contribution is the utilization of the semantic word similarity measures for video retrieval through the trained concept detectors, and the visual co-occurrence relations between such concepts. We...
Show moreVideo retrieval searching and retrieving videos relevant to a user defined query is one of the most popular topics in both real life applications and multimedia research. This thesis employs concepts from Natural Language Understanding in solving the video retrieval problem. Our main contribution is the utilization of the semantic word similarity measures for video retrieval through the trained concept detectors, and the visual co-occurrence relations between such concepts. We propose two methods for content-based retrieval of videos: (1) A method for retrieving a new concept (a concept which is not known to the system and no annotation is available) using semantic word similarity and visual co-occurrence, which is an unsupervised method. (2) A method for retrieval of videos based on their relevance to a user defined text query using the semantic word similarity and visual content of videos. For evaluation purposes, we mainly used the automatic search and the high level feature extraction test set of TRECVID'06 and TRECVID'07 benchmarks. These two data sets consist of 250 hours of multilingual news video captured from American, Arabic, German and Chinese TV channels. Although our method for retrieving a new concept is an unsupervised method, it outperforms the trained concept detectors (which are supervised) on 7 out of 20 test concepts, and overall it performs very close to the trained detectors. On the other hand, our visual content based semantic retrieval method performs more than 100% better than the text-based retrieval method. This shows that using visual content alone we can have significantly good retrieval results.
Show less - Date Issued
- 2008
- Identifier
- CFE0002158, ucf:47521
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002158
- Title
- TAMING CROWDED VISUAL SCENES.
- Creator
-
Ali, Saad, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Computer vision algorithms have played a pivotal role in commercial video surveillance systems for a number of years. However, a common weakness among these systems is their inability to handle crowded scenes. In this thesis, we have developed algorithms that overcome some of the challenges encountered in videos of crowded environments such as sporting events, religious festivals, parades, concerts, train stations, airports, and malls. We adopt a top-down approach by first performing a global...
Show moreComputer vision algorithms have played a pivotal role in commercial video surveillance systems for a number of years. However, a common weakness among these systems is their inability to handle crowded scenes. In this thesis, we have developed algorithms that overcome some of the challenges encountered in videos of crowded environments such as sporting events, religious festivals, parades, concerts, train stations, airports, and malls. We adopt a top-down approach by first performing a global-level analysis that locates dynamically distinct crowd regions within the video. This knowledge is then employed in the detection of abnormal behaviors and tracking of individual targets within crowds. In addition, the thesis explores the utility of contextual information necessary for persistent tracking and re-acquisition of objects in crowded scenes. For the global-level analysis, a framework based on Lagrangian Particle Dynamics is proposed to segment the scene into dynamically distinct crowd regions or groupings. For this purpose, the spatial extent of the video is treated as a phase space of a time-dependent dynamical system in which transport from one region of the phase space to another is controlled by the optical flow. Next, a grid of particles is advected forward in time through the phase space using a numerical integration to generate a ``flow map''. The flow map relates the initial positions of particles to their final positions. The spatial gradients of the flow map are used to compute a Cauchy Green Deformation tensor that quantifies the amount by which the neighboring particles diverge over the length of the integration. The maximum eigenvalue of the tensor is used to construct a forward Finite Time Lyapunov Exponent (FTLE) field that reveals the Attracting Lagrangian Coherent Structures (LCS). The same process is repeated by advecting the particles backward in time to obtain a backward FTLE field that reveals the repelling LCS. The attracting and repelling LCS are the time dependent invariant manifolds of the phase space and correspond to the boundaries between dynamically distinct crowd flows. The forward and backward FTLE fields are combined to obtain one scalar field that is segmented using a watershed segmentation algorithm to obtain the labeling of distinct crowd-flow segments. Next, abnormal behaviors within the crowd are localized by detecting changes in the number of crowd-flow segments over time. Next, the global-level knowledge of the scene generated by the crowd-flow segmentation is used as an auxiliary source of information for tracking an individual target within a crowd. This is achieved by developing a scene structure-based force model. This force model captures the notion that an individual, when moving in a particular scene, is subjected to global and local forces that are functions of the layout of that scene and the locomotive behavior of other individuals in his or her vicinity. The key ingredients of the force model are three floor fields that are inspired by research in the field of evacuation dynamics; namely, Static Floor Field (SFF), Dynamic Floor Field (DFF), and Boundary Floor Field (BFF). These fields determine the probability of moving from one location to the next by converting the long-range forces into local forces. The SFF specifies regions of the scene that are attractive in nature, such as an exit location. The DFF, which is based on the idea of active walker models, corresponds to the virtual traces created by the movements of nearby individuals in the scene. The BFF specifies influences exhibited by the barriers within the scene, such as walls and no-entry areas. By combining influence from all three fields with the available appearance information, we are able to track individuals in high-density crowds. The results are reported on real-world sequences of marathons and railway stations that contain thousands of people. A comparative analysis with respect to an appearance-based mean shift tracker is also conducted by generating the ground truth. The result of this analysis demonstrates the benefit of using floor fields in crowded scenes. The occurrence of occlusion is very frequent in crowded scenes due to a high number of interacting objects. To overcome this challenge, we propose an algorithm that has been developed to augment a generic tracking algorithm to perform persistent tracking in crowded environments. The algorithm exploits the contextual knowledge, which is divided into two categories consisting of motion context (MC) and appearance context (AC). The MC is a collection of trajectories that are representative of the motion of the occluded or unobserved object. These trajectories belong to other moving individuals in a given environment. The MC is constructed using a clustering scheme based on the Lyapunov Characteristic Exponent (LCE), which measures the mean exponential rate of convergence or divergence of the nearby trajectories in a given state space. Next, the MC is used to predict the location of the occluded or unobserved object in a regression framework. It is important to note that the LCE is used for measuring divergence between a pair of particles while the FTLE field is obtained by computing the LCE for a grid of particles. The appearance context (AC) of a target object consists of its own appearance history and appearance information of the other objects that are occluded. The intent is to make the appearance descriptor of the target object more discriminative with respect to other unobserved objects, thereby reducing the possible confusion between the unobserved objects upon re-acquisition. This is achieved by learning the distribution of the intra-class variation of each occluded object using all of its previous observations. In addition, a distribution of inter-class variation for each target-unobservable object pair is constructed. Finally, the re-acquisition decision is made using both the MC and the AC.
Show less - Date Issued
- 2008
- Identifier
- CFE0002135, ucf:47507
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002135
- Title
- MULTI-VIEW GEOMETRIC CONSTRAINTS FOR HUMAN ACTION RECOGNITION AND TRACKING.
- Creator
-
GRITAI, ALEXEI, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Human actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints...
Show moreHuman actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints of two actors in the same posture can be described by a 3D rigid transformation. This transformation simultaneously captures different poses and various sizes and proportions. As a consequence of this conjecture, we show that there exists a fundamental matrix between the imaged positions of the body joints of two actors, if they are in the same posture. Second, we propose a novel projection model for cameras moving at a constant velocity in 3D space, \emph cameras, and derive the Galilean fundamental matrix and apply it to human action recognition. Third, we propose a novel use for the invariant ratio of areas under an affine transformation and utilizing the epipolar geometry between two cameras for 2D model-based tracking of human body joints. In the first part of the thesis, we propose an approach to match human actions using semantic correspondences between human bodies. These correspondences are used to provide geometric constraints between multiple anatomical landmarks ( e.g. hands, shoulders, and feet) to match actions observed from different viewpoints and performed at different rates by actors of differing anthropometric proportions. The fact that the human body has approximate anthropometric proportion allows for innovative use of the machinery of epipolar geometry to provide constraints for analyzing actions performed by people of different anthropometric sizes, while ensuring that changes in viewpoint do not affect matching. A novel measure in terms of rank of matrix constructed only from image measurements of the locations of anatomical landmarks is proposed to ensure that similar actions are accurately recognized. Finally, we describe how dynamic time warping can be used in conjunction with the proposed measure to match actions in the presence of nonlinear time warps. We demonstrate the versatility of our algorithm in a number of challenging sequences and applications including action synchronization , odd one out, following the leader, analyzing periodicity etc. Next, we extend the conventional model of image projection to video captured by a camera moving at constant velocity. We term such moving camera Galilean camera. To that end, we derive the spacetime projection and develop the corresponding epipolar geometry between two Galilean cameras. Both perspective imaging and linear pushbroom imaging form specializations of the proposed model and we show how six different ``fundamental" matrices including the classic fundamental matrix, the Linear Pushbroom (LP) fundamental matrix, and a fundamental matrix relating Epipolar Plane Images (EPIs) are related and can be directly recovered from a Galilean fundamental matrix. We provide linear algorithms for estimating the parameters of the the mapping between videos in the case of planar scenes. For applying fundamental matrix between Galilean cameras to human action recognition, we propose a measure that has two important properties. First property makes it possible to recognize similar actions, if their execution rates are linearly related. Second property allows recognizing actions in video captured by Galilean cameras. Thus, the proposed algorithm guarantees that actions can be correctly matched despite changes in view, execution rate, anthropometric proportions of the actor, and even if the camera moves with constant velocity. Finally, we also propose a novel 2D model based approach for tracking human body parts during articulated motion. The human body is modeled as a 2D stick figure of thirteen body joints and an action is considered as a sequence of these stick figures. Given the locations of these joints in every frame of a model video and the first frame of a test video, the joint locations are automatically estimated throughout the test video using two geometric constraints. First, invariance of the ratio of areas under an affine transformation is used for initial estimation of the joint locations in the test video. Second, the epipolar geometry between the two cameras is used to refine these estimates. Using these estimated joint locations, the tracking algorithm determines the exact location of each landmark in the test video using the foreground silhouettes. The novelty of the proposed approach lies in the geometric formulation of human action models, the combination of the two geometric constraints for body joints prediction, and the handling of deviations in anthropometry of individuals, viewpoints, execution rate, and style of performing action. The proposed approach does not require extensive training and can easily adapt to a wide variety of articulated actions.
Show less - Date Issued
- 2007
- Identifier
- CFE0001692, ucf:47199
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001692
- Title
- LEARNING, DETECTION, REPRESENTATION, INDEXING AND RETRIEVAL OF MULTI-AGENT EVENTS IN VIDEOS.
- Creator
-
Hakeem, Asaad, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
The world that we live in is a complex network of agents and their interactions which are termed as events. An instance of an event is composed of directly measurable low-level actions (which I term sub-events) having a temporal order. Also, the agents can act independently (e.g. voting) as well as collectively (e.g. scoring a touch-down in a football game) to perform an event. With the dawn of the new millennium, the low-level vision tasks such as segmentation, object classification, and...
Show moreThe world that we live in is a complex network of agents and their interactions which are termed as events. An instance of an event is composed of directly measurable low-level actions (which I term sub-events) having a temporal order. Also, the agents can act independently (e.g. voting) as well as collectively (e.g. scoring a touch-down in a football game) to perform an event. With the dawn of the new millennium, the low-level vision tasks such as segmentation, object classification, and tracking have become fairly robust. But a representational gap still exists between low-level measurements and high-level understanding of video sequences. This dissertation is an effort to bridge that gap where I propose novel learning, detection, representation, indexing and retrieval approaches for multi-agent events in videos. In order to achieve the goal of high-level understanding of videos, firstly, I apply statistical learning techniques to model the multiple agent events. For that purpose, I use the training videos to model the events by estimating the conditional dependencies between sub-events. Thus, given a video sequence, I track the people (heads and hand regions) and objects using a Meanshift tracker. An underlying rule-based system detects the sub-events using the tracked trajectories of the people and objects, based on their relative motion. Next, an event model is constructed by estimating the sub-event dependencies, that is, how frequently sub-event B occurs given that sub-event A has occurred. The advantages of such an event model are two-fold. First, I do not require prior knowledge of the number of agents involved in an event. Second, no assumptions are made about the length of an event. Secondly, after learning the event models, I detect events in a novel video by using graph clustering techniques. To that end, I construct a graph of temporally ordered sub-events occurring in the novel video. Next, using the learnt event model, I estimate a weight matrix of conditional dependencies between sub-events in the novel video. Further application of Normalized Cut (graph clustering technique) on the estimated weight matrix facilitate in detecting events in the novel video. The principal assumption made in this work is that the events are composed of highly correlated chains of sub-events that have high conditional dependency (association) within the cluster and relatively low conditional dependency (disassociation) between clusters. Thirdly, in order to represent the detected events, I propose an extension of CASE representation of natural languages. I extend CASE to allow the representation of temporal structure between sub-events. Also, in order to capture both multi-agent and multi-threaded events, I introduce a hierarchical CASE representation of events in terms of sub-events and case-lists. The essence of the proposition is that, based on the temporal relationships of the agent motions and a description of its state, it is possible to build a formal description of an event. Furthermore, I recognize the importance of representing the variations in the temporal order of sub-events, that may occur in an event, and encode the temporal probabilities directly into my event representation. The proposed extended representation with probabilistic temporal encoding is termed P-CASE that allows a plausible means of interface between users and the computer. Using the P-CASE representation I automatically encode the event ontology from training videos. This offers a significant advantage, since the domain experts do not have to go through the tedious task of determining the structure of events by browsing all the videos. Finally, I utilize the event representation for indexing and retrieval of events. Given the different instances of a particular event, I index the events using the P-CASE representation. Next, given a query in the P-CASE representation, event retrieval is performed using a two-level search. At the first level, a maximum likelihood estimate of the query event with the different indexed event models is computed. This provides the maximum matching event. At the second level, a matching score is obtained for all the event instances belonging to the maximum matched event model, using a weighted Jaccard similarity measure. Extensive experimentation was conducted for the detection, representation, indexing and retrieval of multiple agent events in videos of the meeting, surveillance, and railroad monitoring domains. To that end, the Semoran system was developed that takes in user inputs in any of the three forms for event retrieval: using predefined queries in P-CASE representation, using custom queries in P-CASE representation, or query by example video. The system then searches the entire database and returns the matched videos to the user. I used seven standard video datasets from the computer vision community as well as my own videos for testing the robustness of the proposed methods.
Show less - Date Issued
- 2007
- Identifier
- CFE0001620, ucf:47163
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001620
- Title
- KNOWLEDGE BASED MEASUREMENT OF ENHANCING BRAIN TISSUE IN ANISOTROPIC MR IMAGERY.
- Creator
-
Leach, Eric, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Medical Image Analysis has emerged as an important field in the computer vision community. In this thesis, two important issues in medical imaging are addressed and a solution for each is derived and synergistically combined as one coherent system. Firstly, a novel approach is proposed for High Resolution Volume (HRV) construction by combining different frequency components at multiple levels, which are separated by using a multi-resolution pyramid structure. Current clinical imaging...
Show moreMedical Image Analysis has emerged as an important field in the computer vision community. In this thesis, two important issues in medical imaging are addressed and a solution for each is derived and synergistically combined as one coherent system. Firstly, a novel approach is proposed for High Resolution Volume (HRV) construction by combining different frequency components at multiple levels, which are separated by using a multi-resolution pyramid structure. Current clinical imaging protocols make use of multiple orthogonal low resolution scans to measure the size of the tumor. The highly anisotropic data result in difficulty and even errors in tumor assessment. In previous approaches, simple interpolation has been used to construct HRVs from multiple low resolution volumes (LRVs), which fail when large inter-plane spacing is present. In our approach, Laplacian pyramids containing band-pass contents are first computed from registered LRVs. The Laplacian images are expanded in their low resolution axes separately and then fused at each level. A Gaussian pyramid is recovered from the fused Laplacian pyramid, where a volume at the bottom level of the Gaussian pyramid is the constructed HRV. The effectiveness of the proposed approach is validated by using simulated images. The method has also been applied to real clinical data and promising experimental results are demonstrated. Secondly, a new knowledge-based framework to automatically quantify the volume of enhancing tissue in brain MR images is proposed. Our approach provides an objective and consistent way to evaluate disease progression and assess the treatment plan. In our approach, enhanced regions are first located by comparing the difference between the aligned set of pre- and post-contrast T1 MR images. Since some normal tissues may also become enhanced by the administration of Gd-DTPA, using the intensity difference alone may not be able to distinguish normal tissue from the tumor. Thus, we propose a new knowledge-based method employing knowledge of anatomical structures from a probabilistic brain atlas and the prior distribution of brain tumor to identify the real enhancing tissue. Our approach has two main advantages. i) The results are invariant to the image contrast change due to the usage of the probabilistic knowledge-based framework. ii) Using the segmented regions instead of independent pixels facilitates an approach that is much less sensitive to small registration errors and image noise. The obtained results are compared to the ground truth for validation and it is shown that the proposed method can achieve accurate and consistent measurements.
Show less - Date Issued
- 2007
- Identifier
- CFE0001803, ucf:47378
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001803
- Title
- OBJECT ASSOCIATION ACROSS MULTIPLE MOVING CAMERAS IN PLANAR SCENES.
- Creator
-
Sheikh, Yaser, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of object detection and object association across multiple cameras over large areas that are well modeled by planes. We present a unifying probabilistic framework that captures the underlying geometry of planar scenes, and present algorithms to estimate geometric relationships between different cameras, which are subsequently used for co-operative association of objects. We first present a local1 object detection scheme that has three fundamental...
Show moreIn this dissertation, we address the problem of object detection and object association across multiple cameras over large areas that are well modeled by planes. We present a unifying probabilistic framework that captures the underlying geometry of planar scenes, and present algorithms to estimate geometric relationships between different cameras, which are subsequently used for co-operative association of objects. We first present a local1 object detection scheme that has three fundamental innovations over existing approaches. First, the model of the intensities of image pixels as independent random variables is challenged and it is asserted that useful correlation exists in intensities of spatially proximal pixels. This correlation is exploited to sustain high levels of detection accuracy in the presence of dynamic scene behavior, nominal misalignments and motion due to parallax. By using a non-parametric density estimation method over a joint domain-range representation of image pixels, complex dependencies between the domain (location) and range (color) are directly modeled. We present a model of the background as a single probability density. Second, temporal persistence is introduced as a detection criterion. Unlike previous approaches to object detection that detect objects by building adaptive models of the background, the foreground is modeled to augment the detection of objects (without explicit tracking), since objects detected in the preceding frame contain substantial evidence for detection in the current frame. Finally, the background and foreground models are used competitively in a MAP-MRF decision framework, stressing spatial context as a condition of detecting interesting objects and the posterior function is maximized efficiently by finding the minimum cut of a capacitated graph. Experimental validation of the method is performed and presented on a diverse set of data. We then address the problem of associating objects across multiple cameras in planar scenes. Since cameras may be moving, there is a possibility of both spatial and temporal non-overlap in the fields of view of the camera. We first address the case where spatial and temporal overlap can be assumed. Since the cameras are moving and often widely separated, direct appearance-based or proximity-based constraints cannot be used. Instead, we exploit geometric constraints on the relationship between the motion of each object across cameras, to test multiple correspondence hypotheses, without assuming any prior calibration information. Here, there are three contributions. First, we present a statistically and geometrically meaningful means of evaluating a hypothesized correspondence between multiple objects in multiple cameras. Second, since multiple cameras exist, ensuring coherency in association, i.e. transitive closure is maintained between more than two cameras, is an essential requirement. To ensure such coherency we pose the problem of object associating across cameras as a k-dimensional matching and use an approximation to find the association. We show that, under appropriate conditions, re-entering objects can also be re-associated to their original labels. Third, we show that as a result of associating objects across the cameras, a concurrent visualization of multiple aerial video streams is possible. Results are shown on a number of real and controlled scenarios with multiple objects observed by multiple cameras, validating our qualitative models. Finally, we present a unifying framework for object association across multiple cameras and for estimating inter-camera homographies between (spatially and temporally) overlapping and non-overlapping cameras, whether they are moving or non-moving. By making use of explicit polynomial models for the kinematics of objects, we present algorithms to estimate inter-frame homographies. Under an appropriate measurement noise model, an EM algorithm is applied for the maximum likelihood estimation of the inter-camera homographies and kinematic parameters. Rather than fit curves locally (in each camera) and match them across views, we present an approach that simultaneously refines the estimates of inter-camera homographies and curve coefficients globally. We demonstrate the efficacy of the approach on a number of real sequences taken from aerial cameras, and report quantitative performance during simulations.
Show less - Date Issued
- 2006
- Identifier
- CFE0001045, ucf:46797
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001045
- Title
- MULTIZOOM ACTIVITY RECOGNITION USING MACHINE LEARNING.
- Creator
-
Smith, Raymond, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this thesis we present a system for detection of events in video. First a multiview approach to automatically detect and track heads and hands in a scene is described. Then, by making use of epipolar, spatial, trajectory, and appearance constraints, objects are labeled consistently across cameras (zooms). Finally, we demonstrate a new machine learning paradigm, TemporalBoost, that can recognize events in video. One aspect of any machine learning algorithm is in the feature set used. The...
Show moreIn this thesis we present a system for detection of events in video. First a multiview approach to automatically detect and track heads and hands in a scene is described. Then, by making use of epipolar, spatial, trajectory, and appearance constraints, objects are labeled consistently across cameras (zooms). Finally, we demonstrate a new machine learning paradigm, TemporalBoost, that can recognize events in video. One aspect of any machine learning algorithm is in the feature set used. The approach taken here is to build a large set of activity features, though TemporalBoost itself is able to work with any feature set other boosting algorithms use. We also show how multiple levels of zoom can cooperate to solve problems related to activity recognition.
Show less - Date Issued
- 2005
- Identifier
- CFE0000865, ucf:46658
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000865
- Title
- VISUAL INSPECTION OF RAILROAD TRACKS.
- Creator
-
Babenko, Pavel, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this dissertation, we have developed computer vision methods for measurement of rail gauge, and reliable identification and localization of structural defects in railroad tracks. The rail gauge is the distance between the innermost sides of the two parallel steel rails. We have developed two methods for evaluation of rail gauge. These methods were designed for different hardware setups: the first method works with two pairs of unaligned video cameras while the second method works with...
Show moreIn this dissertation, we have developed computer vision methods for measurement of rail gauge, and reliable identification and localization of structural defects in railroad tracks. The rail gauge is the distance between the innermost sides of the two parallel steel rails. We have developed two methods for evaluation of rail gauge. These methods were designed for different hardware setups: the first method works with two pairs of unaligned video cameras while the second method works with depth maps generated by paired laser range scanners. We have also developed a method for detection of rail defects such as damaged or missed rail fasteners, tie clips, and bolts, based on correlation and MACH filters. Lastly, to make our algorithms perform in real-time, we have developed the GPU based library for parallel computation of the above algorithms. Rail gauge is the most important measurement for track maintenance, because deviations in gauge indicate where potential defects may exist. We have developed a vision-based method for rail gauge estimation from a pair of industrial laser range scanners. In this approach, we start with building a 3D panorama of the rail out of a stack of input scans. After the panorama is built, we apply FIR circular filtering and Gaussian smoothing to the panorama buffer to suppress the noise component. In the next step we attempt to segment the rail heads in the panorama buffer. We employ the method which detects railroad crossings or forks in the panorama buffer. If they are not present, we find the rail edge using robust line fit. If they are present we use an alternative way: we predict the rail edge positions using Kalman filter. In the next step, common to both fork/crossings conditions, we find the adjusted positions of rail edges using additional clustering in the vicinity of the edge. We approximate rail head surface by the third degree polynomial and then fit two plane surfaces to find the exact position of the rail edge. Lastly, using rail edge information, we calculate the rail gauge and smooth it with 1D Gaussian filter. We have also developed a vision-based method to estimate the rail gauge from a pair of unaligned high shutter speed calibrated cameras. In this approach, the first step is to accurately detect the rail in each of the two non-overlapping synchronous images from the two cameras installed on the data collection cart by building an edge map, and fitting lines into the edge map using the Hough transform, and detecting persistent edge lines using a history buffer. After railroad track parts are detected, we segment rails out to find rail edges and calculate the rail gauge. We have demonstrated how to apply Computer Vision methods (the correlation filters and MACH filters in particular) to find different types of railroad elements with fixed or similar appearance, like railroad clips, bolts, and rail plates, in real-time. Template-based approaches for object detection (correlation filters) directly compare gray scale image data to a predefined model or template. The drawback of the correlation filters has always been that they are neither scale nor rotation invariant, thus many different filters are needed if either scale or rotation change. The application of many filters cannot be done in real-time. We have succeeded to overcome this difficulty by using the parallel computation technology which is widely available in the GPUs of most advanced graphics cards. We have developed a library, MinGPU, which facilitates the use of GPUs for Computer Vision, and have also developed a MinGPU-based library of several Computer Vision methods, which includes, among others, an implementation of correlation filters on the GPU. We have achieved a true positive rate of 0.98 for fastener detection using implementation of MACH filters on GPU. Besides correlation filters, MinGPU include implementations of Lucas-Kanade Optical Flow, image homographies, edge detectors and discrete filters, image pyramids, morphology operations, and some graphics primitives. We have shown that MinGPU implementation of homographies speeds up execution time approximately 600 times versus C implementation and 8000 times versus Matlab implementation. MinGPU is built upon a reusable core and thus is an easily expandable library. With the help of MinGPU, we have succeeded to make our algorithms work in real-time.
Show less - Date Issued
- 2009
- Identifier
- CFE0002895, ucf:48038
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002895
- Title
- MODELING SCENES AND HUMAN ACTIVITIES IN VIDEOS.
- Creator
-
Basharat, Arslan, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of understanding human activities in videos by developing a two-pronged approach: coarse level modeling of scene activities and fine level modeling of individual activities. At the coarse level, where the resolution of the video is low, we rely on person tracks. At the fine level, richer features are available to identify different parts of the human body, therefore we rely on the body joint tracks. There are three main goals of this dissertation: ...
Show moreIn this dissertation, we address the problem of understanding human activities in videos by developing a two-pronged approach: coarse level modeling of scene activities and fine level modeling of individual activities. At the coarse level, where the resolution of the video is low, we rely on person tracks. At the fine level, richer features are available to identify different parts of the human body, therefore we rely on the body joint tracks. There are three main goals of this dissertation: (1) identify unusual activities at the coarse level, (2) recognize different activities at the fine level, and (3) predict the behavior for synthesizing and tracking activities at the fine level. The first goal is addressed by modeling activities at the coarse level through two novel and complementing approaches. The first approach learns the behavior of individuals by capturing the patterns of motion and size of objects in a compact model. Probability density function (pdf) at each pixel is modeled as a multivariate Gaussian Mixture Model (GMM), which is learnt using unsupervised expectation maximization (EM). In contrast, the second approach learns the interaction of object pairs concurrently present in the scene. This can be useful in detecting more complex activities than those modeled by the first approach. We use a 14-dimensional Kernel Density Estimation (KDE) that captures motion and size of concurrently tracked objects. The proposed models have been successfully used to automatically detect activities like unusual person drop-off and pickup, jaywalking, etc. The second and third goals of modeling human activities at the fine level are addressed by employing concepts from theory of chaos and non-linear dynamical systems. We show that the proposed model is useful for recognition and prediction of the underlying dynamics of human activities. We treat the trajectories of human body joints as the observed time series generated from an underlying dynamical system. The observed data is used to reconstruct a phase (or state) space of appropriate dimension by employing the delay-embedding technique. This transformation is performed without assuming an exact model of the underlying dynamics and provides a characteristic representation that will prove to be vital for recognition and prediction tasks. For recognition, properties of phase space are captured in terms of dynamical and metric invariants, which include the Lyapunov exponent, correlation integral, and correlation dimension. A composite feature vector containing these invariants represents the action and will be used for classification. For prediction, kernel regression is used in the phase space to compute predictions with a specified initial condition. This approach has the advantage of modeling dynamics without making any assumptions about the exact form (polynomial, radial basis, etc.) of the mapping function. We demonstrate the utility of these predictions for human activity synthesis and tracking.
Show less - Date Issued
- 2009
- Identifier
- CFE0002897, ucf:48042
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002897
- Title
- AUDIO AND VIDEO TEMPO ANALYSIS FOR DANCE DETECTION.
- Creator
-
Faircloth, Ryan, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
The amount of multimedia in existence has become so extensive that the organization of this data cannot be performed manually. Systems designed to maintain such quantity need superior methods of understanding the information contained in the data. Aspects of Computer Vision deal with such problems for the understanding of image and video content. Additionally large ontologies such as LSCOM are collections of feasible high-level concepts that are of interest to identify within multimedia...
Show moreThe amount of multimedia in existence has become so extensive that the organization of this data cannot be performed manually. Systems designed to maintain such quantity need superior methods of understanding the information contained in the data. Aspects of Computer Vision deal with such problems for the understanding of image and video content. Additionally large ontologies such as LSCOM are collections of feasible high-level concepts that are of interest to identify within multimedia content. While ontologies often include the activity of dance it has had virtually no coverage in Computer Vision literature in terms of actual detection. We will demonstrate the fact that training based approaches are challenged by dance because the activity is defined by an unlimited set of movements and therefore unreasonable amounts of training data would be required to recognize even a small portion of the immense possibilities for dance. In this thesis we present a non-training, tempo based approach to dance detection which yields very good results when compared to another method with state-of-the-art performance for other common activities; the testing dataset contains videos acquired mostly through YouTube. The algorithm is based on one dimensional analysis in which we perform visual beat detection through the computation of optical flow. Next we obtain a set of tempo hypotheses and the final stage of our method tracks visual beats through a video sequence in order to determine the most likely tempo for the object motion. In this thesis we will not only demonstrate the utility for visual beats in visual tempo detection but we will demonstrate their existence in most of the common activities considered by state-of-the-art methods.
Show less - Date Issued
- 2008
- Identifier
- CFE0002194, ucf:47900
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002194
- Title
- Video categorization using semantics and semiotics.
- Creator
-
Rasheed, Zeeshan, Shah, Mubarak, Engineering and Computer Science
- Abstract / Description
-
University of Central Florida College of Engineering Thesis; There is a great need to automatically segment, categorize, and annotate video data, and to develop efficient tools for browsing and searching. We believe that the categorization of videos can be achieved by exploring the concepts and meanings of the videos. This task requires bridging the gap between low-level content and high-level concepts (or semantics). Once a relationship is established between the low-level computable...
Show moreUniversity of Central Florida College of Engineering Thesis; There is a great need to automatically segment, categorize, and annotate video data, and to develop efficient tools for browsing and searching. We believe that the categorization of videos can be achieved by exploring the concepts and meanings of the videos. This task requires bridging the gap between low-level content and high-level concepts (or semantics). Once a relationship is established between the low-level computable features of the video and its semantics, .the user would be able to navigate through videos through the use of concepts and ideas (for example, a user could extract only those scenes in an action film that actually contain fights) rat her than sequentially browsing the whole video. However, this relationship must follow the norms of human perception and abide by the rules that are most often followed by the creators (directors) of these videos. These rules are called film grammar in video production literature. Like any natural language, this grammar has several dialects, but it has been acknowledged to be universal. Therefore, the knowledge of film grammar can be exploited effectively for the understanding of films. To interpret an idea using the grammar, we need to first understand the symbols, as in natural languages, and second, understand the rules of combination of these symbols to represent concepts. In order to develop algorithms that exploit this film grammar, it is necessary to relate the symbols of the grammar to computable video features.
Show less - Date Issued
- 2003
- Identifier
- CFR0001717, ucf:52920
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFR0001717
- Title
- Edge Contours.
- Creator
-
Williams, Donna J., Shah, Mubarak A., Arts and Sciences
- Abstract / Description
-
University of Central Florida College of Arts and Sciences Thesis; The accuracy with which a computer vision system is able to identify objects in an image is heavily dependent upon the accuracy of the low level processes that identify which points lie on the edges of an object. In order to remove noise and fine texture from an image, it is usually smoothed before edge detection is performed. This smoothing causes edges to be displaced from their actual location in the image. Knowledge about...
Show moreUniversity of Central Florida College of Arts and Sciences Thesis; The accuracy with which a computer vision system is able to identify objects in an image is heavily dependent upon the accuracy of the low level processes that identify which points lie on the edges of an object. In order to remove noise and fine texture from an image, it is usually smoothed before edge detection is performed. This smoothing causes edges to be displaced from their actual location in the image. Knowledge about the changes that occur with different degrees of smoothing (scales) and the physical conditions that cause these changes is essential to proper interpretation of the results obtained. In this work the amount of delocalization and the magnitude of the response to the Normalized Gradient of Gaussian operator are analyzed as a function of cr, the standard deviation of the Gaussian. As a result of this analysis it was determined that edge points could be characterized as to slope, contrast, and proximity to other edges. The analysis is also used to define the size that the neighborhood of an edge point must be in order to assure its containing the delocalized edge point at another scale when o is known. Given this theoretical background, an algorithm was developed to obtain sequential lists of edge points. This used multiple scales in order to achieve the superior localization and detection of weak edges possible with smaller scales combined with the noise suppression of the larger scales. The edge contours obtained with this method are significantly better than those achieved with a single scale. A second algorithm was developed to allow sets of edge contour points to be represented as active contours so that interaction with a higher level process is possible. This higher level process could do such things as determine where corners or discontinuities could appear. The algorithm developed here allows hard constraints and represents a significant improvement in speed over previous algorithms allowing hard constraints, being linear rather than cubic.
Show less - Date Issued
- 1989
- Identifier
- CFR0000160, ucf:52912
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFR0000160
- Title
- OBJECT TRACKING AND ACTIVITY RECOGNITION IN VIDEO ACQUIRED USING MOBILE CAMERAS.
- Creator
-
Yilmaz, Alper, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Due to increasing demand on deployable surveillance systems in recent years, object tracking and activity recognition are receiving considerable attention in the research community. This thesis contributes to both the tracking and the activity recognition components of a surveillance system. In particular, for the tracking component, we propose two different approaches for tracking objects in video acquired by mobile cameras, each of which uses a different object shape representation. The...
Show moreDue to increasing demand on deployable surveillance systems in recent years, object tracking and activity recognition are receiving considerable attention in the research community. This thesis contributes to both the tracking and the activity recognition components of a surveillance system. In particular, for the tracking component, we propose two different approaches for tracking objects in video acquired by mobile cameras, each of which uses a different object shape representation. The first approach tracks the centroids of the objects in Forward Looking Infrared Imagery (FLIR) and is suitable for tracking objects that appear small in airborne video. The second approach tracks the complete contours of the objects, and is suitable for higher level vision problems, such as activity recognition, identification and classification. Using the contours tracked by the contour tracker, we propose a novel representation, called the action sketch, for recognizing human activities.Object Tracking in Airborne Imagery: Images obtained from an airborne vehicle generally appear small and can be represented by geometric shapes such as circle or rectangle. After detecting the object position in the first frame, the proposed object tracker models the intensity and the local standard deviation of the object region defined by the shape model. It then tracks the objects by computing the mean-shift vector that minimizes the distance between the kernel distribution for the hypothesized object and its prior. In cases when the ego-motion of the sensor causes the object to move more than the operational limits of the tracking module, a multi-resolution global motion compensation using the Gabor responses of consecutive frames is performed. The experiments performed on the AMCOM FLIR data set show the robustness of the proposed method, which combines automatic model update and global motion compensation into one framework.Contour Tracker: Contour tracking is performed by evolving an initial contour toward the correct object boundaries based on discriminant analysis, which is formulated as a variational calculus problem. Once the contour is initialized, the method generates an online shape model for the object along with the color and the texture priors for both the object and the background regions. A priori texture and color PDFs of the regions are then fused based on the discrimination properties of the features between the object and the background models. The models are then used to compute the posteriori contour likelihood and the evolution is obtained by the Maximum a Posteriori Estimation process, which updates the contour in the gradient ascent direction of the proposed energy functional. During occlusion, the online shape model is used to complete the missing object region. The proposed energy functional unifies commonly used boundary and region based contour approaches into a single framework through a support region defined around the hypothesized object contour. We tested the robustness of the proposed contour tracker using several real sequences and have verified qualitatively that the contours of the objects are perfectly tracked.Behavior Analysis: We propose a novel approach to represent human actions by modeling the dynamics (motion) and the structure (shape) of the objects in video. Both the motion and the shape are modeled using a compact representation, which is called the ``action sketch''. An action sketch is a view invariant representation obtained by analyzing important changes that occur during the motion of the objects. When an actor performs an action in 3D, the points on the actor generate space-time trajectories in four dimensions $(x,y,z,t)$. Projection of the world to the imaging coordinates converts the space-time trajectories into the spatio-temporal trajectories in three dimensions $(x,y,t)$. A set of spatio-temporal trajectories constitute a 3D volume, which we call an ``action volume''. This volum
Show less - Date Issued
- 2004
- Identifier
- CFE0000101, ucf:52858
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000101
- Title
- LEARNING SEMANTIC FEATURES FOR VISUAL RECOGNITION.
- Creator
-
Liu, Jingen, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Visual recognition (e.g., object, scene and action recognition) is an active area of research in computer vision due to its increasing number of real-world applications such as video (image) indexing and search, intelligent surveillance, human-machine interaction, robot navigation, etc. Effective modeling of the objects, scenes and actions is critical for visual recognition. Recently, bag of visual words (BoVW) representation, in which the image patches or video cuboids are quantized into...
Show moreVisual recognition (e.g., object, scene and action recognition) is an active area of research in computer vision due to its increasing number of real-world applications such as video (image) indexing and search, intelligent surveillance, human-machine interaction, robot navigation, etc. Effective modeling of the objects, scenes and actions is critical for visual recognition. Recently, bag of visual words (BoVW) representation, in which the image patches or video cuboids are quantized into visual words (i.e., mid-level features) based on their appearance similarity using clustering, has been widely and successfully explored. The advantages of this representation are: no explicit detection of objects or object parts and their tracking are required; the representation is somewhat tolerant to within-class deformations, and it is e±cient for matching. However, the performance of the BoVW is sensitive to the size of the visual vocabulary. Therefore, computationally expensive cross-validation is needed to find the appropriate quantization granularity. This limitation is partially due to the fact that the visual words are not semantically meaningful. This limits the effectiveness and compactness of the representation. To overcome these shortcomings, in this thesis we present principled approach to learn a semantic vocabulary (i.e. high-level features) from a large amount of visual words (mid-level features). In this context, the thesis makes two major contributions. First, we have developed an algorithm to discover a compact yet discriminative semantic vocabulary. This vocabulary is obtained by grouping the visual-words based on their distribution in videos (images) into visual-word clusters. The mutual information (MI) be- tween the clusters and the videos (images) depicts the discriminative power of the semantic vocabulary, while the MI between visual-words and visual-word clusters measures the compactness of the vocabulary. We apply the information bottleneck (IB) algorithm to find the optimal number of visual-word clusters by ¯nding the good tradeoff between compactness and discriminative power. We tested our proposed approach on the state-of-the-art KTH dataset, and obtained average accuracy of 94.2%. However, this approach performs one-side clustering, because only visual words are clustered regardless of which video they appear in. In order to leverage the co-occurrence of visual words and images, we have developed the co-clustering algorithm to simultaneously group the visual words and images. We tested our approach on the publicly available ¯fteen scene dataset and have obtained about 4% increase in the average accuracy compared to the one side clustering approaches. Second, instead of grouping the mid-level features, we first embed the features into a low-dimensional semantic space by manifold learning, and then perform the clustering. We apply Diffusion Maps (DM) to capture the local geometric structure of the mid-level feature space. The DM embedding is able to preserve the explicitly defined diffusion distance, which reflects the semantic similarity between any two features. Furthermore, the DM provides multi-scale analysis capability by adjusting the time steps in the Markov transition matrix. The experiments on KTH dataset show that DM can perform much better (about 3% to 6% improvement in average accuracy) than other manifold learning approaches and IB method. Above methods use only single type of features. In order to combine multiple heterogeneous features for visual recognition, we further propose the Fielder Embedding to capture the complicated semantic relationships between all entities (i.e., videos, images,heterogeneous features). The discovered relationships are then employed to further increase the recognition rate. We tested our approach on Weizmann dataset, and achieved about 17% 21% improvements in the average accuracy.
Show less - Date Issued
- 2009
- Identifier
- CFE0002936, ucf:47961
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002936
- Title
- A STATISTICAL APPROACH TO VIEW SYNTHESIS.
- Creator
-
Berkowitz, Phillip, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
View Synthesis is the challenging problem of predicting a new view or pose of an object given an exemplar view or set of views. This thesis presents a novel approach for the problem of view synthesis. The proposed method uses global features rather than local geometry to achieve an effect similar to that of the well known view morphing method . While previous approaches to the view synthesis problem have shown impressive results, they are highly dependent on being able to solve for epipolar...
Show moreView Synthesis is the challenging problem of predicting a new view or pose of an object given an exemplar view or set of views. This thesis presents a novel approach for the problem of view synthesis. The proposed method uses global features rather than local geometry to achieve an effect similar to that of the well known view morphing method . While previous approaches to the view synthesis problem have shown impressive results, they are highly dependent on being able to solve for epipolar geometry and therefore have a very precise correspondence between reference images. In cases where this is not possible such as noisy data, low contrast data, or long wave infrared data an alternative approach is desirable. Here two problems will be considered. The proposed view synthesis method will be used to synthesis new views given a set of reference views. Additionally the algorithm will be extended to synthesis new lighting conditions and thermal signatures. Finally the algorithm will be applied toward enhancing the ATR problem by creating additional training data to increase the likelihood of detection and classification.
Show less - Date Issued
- 2009
- Identifier
- CFE0002684, ucf:48214
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002684
- Title
- SPATIO-TEMPORAL MAXIMUM AVERAGE CORRELATION HEIGHT TEMPLATES IN ACTION RECOGNITION AND VIDEO SUMMARIZATION.
- Creator
-
Rodriguez, Mikel, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Action recognition represents one of the most difficult problems in computer vision given that it embodies the combination of several uncertain attributes, such as the subtle variability associated with individual human behavior and the challenges that come with viewpoint variations, scale changes and different temporal extents. Nevertheless, action recognition solutions are critical in a great number of domains, such video surveillance, assisted living environments, video search, interfaces,...
Show moreAction recognition represents one of the most difficult problems in computer vision given that it embodies the combination of several uncertain attributes, such as the subtle variability associated with individual human behavior and the challenges that come with viewpoint variations, scale changes and different temporal extents. Nevertheless, action recognition solutions are critical in a great number of domains, such video surveillance, assisted living environments, video search, interfaces, and virtual reality. In this dissertation, we investigate template-based action recognition algorithms that can incorporate the information contained in a set of training examples, and we explore how these algorithms perform in action recognition and video summarization. First, we introduce a template-based method for recognizing human actions called Action MACH. Our approach is based on a Maximum Average Correlation Height (MACH) filter. MACH is capable of capturing intra-class variability by synthesizing a single Action MACH filter for a given action class. We generalize the traditional MACH filter to video (3D spatiotemporal volume), and vector valued data. By analyzing the response of the filter in the frequency domain, we avoid the high computational cost commonly incurred in template-based approaches. Vector valued data is analyzed using the Clifford Fourier transform, a generalization of the Fourier transform intended for both scalar and vector-valued data. Next, we address three seldom explored challenges in template-based action recognition. The first is the recognition and localization of human actions in aerial videos obtained from unmanned aerial vehicles (UAVs), a new medium which presents unique challenges due to the small number of pixels per human, pose, and moving camera. The second issue we address is the incorporation of multiple positive and negative examples of a target action class when generating an action template. We address this issue by employing the Fukunaga-Koontz Transform as a means of generating a single quadratic template which, unlike traditional temporal templates (which rely on positive examples alone), effectively captures the variability associated with an action class by including both positive and negative examples in the template training process. Third, we explore the problem of generating video summaries that include specific actions of interest as opposed to all moving objects. In doing so, we explore the role of action templates in video summarization in an effort to provide a means of generating a compact video representation based on a set of activities of interest. We introduce an approach in which a user specifies the activities that interest him and the video is automatically condensed to a short clip which captures the most relevant events based on the user's preference. We follow the output summary video format of non-chronological video synopsis approaches, in which different events which occur at different times may be displayed concurrently, even though they never occur simultaneously in the original video. However, instead of assuming that all moving objects are interesting, priority is given to specific activities of interest which pertain to a user's query. This provides an efficient means of browsing through large collections of video for events of interest.
Show less - Date Issued
- 2010
- Identifier
- CFE0003313, ucf:48507
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003313