Current Search: recognition (x)
View All Items
Pages
- Title
- Complex Affect Recognition in the Wild.
- Creator
-
Nojavanasghari, Behnaz, Hughes, Charles, Morency, Louis-Philippe, Sukthankar, Gita, Foroosh, Hassan, Morency, Louis-Philippe, University of Central Florida
- Abstract / Description
-
Arti?cial social intelligence is a step towards human-like human-computer interaction. One important milestone towards building socially intelligent systems is enabling computers with the ability to process and interpret the social signals of humans in the real world. Social signals include a wide range of emotional responses from a simple smile to expressions of complex affects.This dissertation revolves around computational models for social signal processing in the wild, using multimodal...
Show moreArti?cial social intelligence is a step towards human-like human-computer interaction. One important milestone towards building socially intelligent systems is enabling computers with the ability to process and interpret the social signals of humans in the real world. Social signals include a wide range of emotional responses from a simple smile to expressions of complex affects.This dissertation revolves around computational models for social signal processing in the wild, using multimodal signals with an emphasis on the visual modality. We primarily focus on complex affect recognition with a strong interest in curiosity. In this dissertation,we ?rst present our collected dataset, EmoReact. We provide detailed multimodal behavior analysis across audio-visual signals and present unimodal and multimodal classi?cation models for affect recognition. Second, we present a deep multimodal fusion algorithm to fuse information from visual, acoustic and verbal channels to achieve a uni?ed classi?cation result. Third, we present a novel system to synthesize, recognize and localize facial occlusions. The proposed framework is based on a three-stage process: 1) Synthesis of naturalistic occluded faces, which include hand over face occlusions as well as other common occlusions such as hair bangs, scarf, hat, etc. 2) Recognition of occluded faces and differentiating between hand over face and other types of facial occlusions. 3) Localization of facial occlusions and identifying the occluded facial regions. Region of facial occlusion, plays an importantroleinrecognizingaffectandashiftinlocationcanresultinaverydifferentinterpretation, e.g., hand over chin can indicate contemplation, while hand over eyes may show frustration or sadness. Finally, we show the importance of considering facial occlusion type and region in affect recognition through achieving promising results in our experiments.
Show less - Date Issued
- 2017
- Identifier
- CFE0007291, ucf:52163
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007291
- Title
- LEARNING SEMANTIC FEATURES FOR VISUAL RECOGNITION.
- Creator
-
Liu, Jingen, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Visual recognition (e.g., object, scene and action recognition) is an active area of research in computer vision due to its increasing number of real-world applications such as video (image) indexing and search, intelligent surveillance, human-machine interaction, robot navigation, etc. Effective modeling of the objects, scenes and actions is critical for visual recognition. Recently, bag of visual words (BoVW) representation, in which the image patches or video cuboids are quantized into...
Show moreVisual recognition (e.g., object, scene and action recognition) is an active area of research in computer vision due to its increasing number of real-world applications such as video (image) indexing and search, intelligent surveillance, human-machine interaction, robot navigation, etc. Effective modeling of the objects, scenes and actions is critical for visual recognition. Recently, bag of visual words (BoVW) representation, in which the image patches or video cuboids are quantized into visual words (i.e., mid-level features) based on their appearance similarity using clustering, has been widely and successfully explored. The advantages of this representation are: no explicit detection of objects or object parts and their tracking are required; the representation is somewhat tolerant to within-class deformations, and it is e±cient for matching. However, the performance of the BoVW is sensitive to the size of the visual vocabulary. Therefore, computationally expensive cross-validation is needed to find the appropriate quantization granularity. This limitation is partially due to the fact that the visual words are not semantically meaningful. This limits the effectiveness and compactness of the representation. To overcome these shortcomings, in this thesis we present principled approach to learn a semantic vocabulary (i.e. high-level features) from a large amount of visual words (mid-level features). In this context, the thesis makes two major contributions. First, we have developed an algorithm to discover a compact yet discriminative semantic vocabulary. This vocabulary is obtained by grouping the visual-words based on their distribution in videos (images) into visual-word clusters. The mutual information (MI) be- tween the clusters and the videos (images) depicts the discriminative power of the semantic vocabulary, while the MI between visual-words and visual-word clusters measures the compactness of the vocabulary. We apply the information bottleneck (IB) algorithm to find the optimal number of visual-word clusters by ¯nding the good tradeoff between compactness and discriminative power. We tested our proposed approach on the state-of-the-art KTH dataset, and obtained average accuracy of 94.2%. However, this approach performs one-side clustering, because only visual words are clustered regardless of which video they appear in. In order to leverage the co-occurrence of visual words and images, we have developed the co-clustering algorithm to simultaneously group the visual words and images. We tested our approach on the publicly available ¯fteen scene dataset and have obtained about 4% increase in the average accuracy compared to the one side clustering approaches. Second, instead of grouping the mid-level features, we first embed the features into a low-dimensional semantic space by manifold learning, and then perform the clustering. We apply Diffusion Maps (DM) to capture the local geometric structure of the mid-level feature space. The DM embedding is able to preserve the explicitly defined diffusion distance, which reflects the semantic similarity between any two features. Furthermore, the DM provides multi-scale analysis capability by adjusting the time steps in the Markov transition matrix. The experiments on KTH dataset show that DM can perform much better (about 3% to 6% improvement in average accuracy) than other manifold learning approaches and IB method. Above methods use only single type of features. In order to combine multiple heterogeneous features for visual recognition, we further propose the Fielder Embedding to capture the complicated semantic relationships between all entities (i.e., videos, images,heterogeneous features). The discovered relationships are then employed to further increase the recognition rate. We tested our approach on Weizmann dataset, and achieved about 17% 21% improvements in the average accuracy.
Show less - Date Issued
- 2009
- Identifier
- CFE0002936, ucf:47961
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002936
- Title
- ACTION RECOGNITION USING PARTICLE FLOW FIELDS.
- Creator
-
Reddy, Kishore, Shah, Mubarak, Sukthankar, Gita, Wei, Lei, Moore, Brian, University of Central Florida
- Abstract / Description
-
In recent years, research in human action recognition has advanced on multiple fronts to address various types of actions including simple, isolated actions in staged data (e.g., KTH dataset), complex actions (e.g., Hollywood dataset), and naturally occurring actions in surveillance videos (e.g, VIRAT dataset). Several techniques including those based on gradient, flow, and interest-points, have been developed for their recognition. Most perform very well in standard action recognition...
Show moreIn recent years, research in human action recognition has advanced on multiple fronts to address various types of actions including simple, isolated actions in staged data (e.g., KTH dataset), complex actions (e.g., Hollywood dataset), and naturally occurring actions in surveillance videos (e.g, VIRAT dataset). Several techniques including those based on gradient, flow, and interest-points, have been developed for their recognition. Most perform very well in standard action recognition datasets, but fail to produce similar results in more complex, large-scale datasets. Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (six actions), IXMAS (thirteen actions), and Weizmann (ten actions). Challenges such as camera motion, different viewpoints, huge interclass variations, cluttered background, occlusions, bad illumination conditions, and poor quality of web videos cause the majority of the state-of-the-art action recognition approaches to fail. An increasing number of categories and the inclusion of actions with high confusion also increase the difficulty of the problem. The approach taken to solve this action recognition problem depends primarily on the dataset and the possibility of detecting and tracking the object of interest. In this dissertation, a new method for video representation is proposed and three new approaches to perform action recognition in different scenarios using varying prerequisites are presented. The prerequisites have decreasing levels of difficulty to obtain: 1) Scenario requires human detection and tracking to perform action recognition; 2) Scenario requires background and foreground separation to perform action recognition; and 3) No pre-processing is required for action recognition.First, we propose a new video representation using optical flow and particle advection. The proposed ``Particle Flow Field'' (PFF) representation has been used to generate motion descriptors and tested in a Bag of Video Words (BoVW) framework on the KTH dataset. We show that particle flow fields has better performance than other low-level video representations, such as 2D-Gradients, 3D-Gradients and optical flow. Second, we analyze the performance of the state-of-the-art technique based on the histogram of oriented 3D-Gradients in spatio temporal volumes, where human detection and tracking are required. We use the proposed particle flow field and show superior results compared to the histogram of oriented 3D-Gradients in spatio temporal volumes. The proposed method, when used for human action recognition, just needs human detection and does not necessarily require human tracking and figure centric bounding boxes. It has been tested on KTH (six actions), Weizmann (ten actions), and IXMAS (thirteen actions, 4 different views) action recognition datasets.Third, we propose using the scene context information obtained from moving and stationary pixels in the key frames, in conjunction with motion descriptors obtained using Bag of Words framework, to solve the action recognition problem on a large (50 actions) dataset with videos from the web. We perform a combination of early and late fusion on multiple features to handle the huge number of categories. We demonstrate that scene context is a very important feature for performing action recognition on huge datasets.The proposed method needs separation of moving and stationary pixels, and does not require any kind of video stabilization, person detection, or tracking and pruning of features. Our approach obtains good performance on a huge number of action categories. It has been tested on the UCF50 dataset with 50 action categories, which is an extension of the UCF YouTube Action (UCF11) Dataset containing 11 action categories. We also tested our approach on the KTH and HMDB51 datasets for comparison.Finally, we focus on solving practice problems in representing actions by bag of spatio temporal features (i.e. cuboids), which has proven valuable for action recognition in recent literature. We observed that the visual vocabulary based (bag of video words) method suffers from many drawbacks in practice, such as: (i) It requires an intensive training stage to obtain good performance; (ii) it is sensitive to the vocabulary size; (iii) it is unable to cope with incremental recognition problems; (iv) it is unable to recognize simultaneous multiple actions; (v) it is unable to perform recognition frame by frame. In order to overcome these drawbacks, we propose a framework to index large scale motion features using Sphere/Rectangle-tree (SR-tree) for incremental action detection and recognition. The recognition comprises of the following two steps: 1) recognizing the local features by non-parametric nearest neighbor (NN), and 2) using a simple voting strategy to label the action. It can also provide localization of the action. Since it does not require feature quantization it can efficiently grow the feature-tree by adding features from new training actions or categories. Our method provides an effective way for practical incremental action recognition. Furthermore, it can handle large scale datasets because the SR-tree is a disk-based data structure. We tested our approach on two publicly available datasets, the KTH dataset and the IXMAS multi-view dataset, and achieved promising results.
Show less - Date Issued
- 2012
- Identifier
- CFE0004626, ucf:49923
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004626
- Title
- A study of holistic strategies for the recognition of characters in natural scene images.
- Creator
-
Ali, Muhammad, Foroosh, Hassan, Hughes, Charles, Sukthankar, Gita, Wiegand, Rudolf, Yun, Hae-Bum, University of Central Florida
- Abstract / Description
-
Recognition and understanding of text in scene images is an important and challenging task. The importance can be seen in the context of tasks such as assisted navigation for the blind, providing directions to driverless cars, e.g. Google car, etc. Other applications include automated document archival services, mining text from images, and so on. The challenge comes from a variety of factors, like variable typefaces, uncontrolled imaging conditions, and various sources of noise corrupting...
Show moreRecognition and understanding of text in scene images is an important and challenging task. The importance can be seen in the context of tasks such as assisted navigation for the blind, providing directions to driverless cars, e.g. Google car, etc. Other applications include automated document archival services, mining text from images, and so on. The challenge comes from a variety of factors, like variable typefaces, uncontrolled imaging conditions, and various sources of noise corrupting the captured images. In this work, we study and address the fundamental problem of recognition of characters extracted from natural scene images, and contribute three holistic strategies to deal with this challenging task. Scene text recognition (STR) has been a known problem in computer vision and pattern recognition community for over two decades, and is still an active area of research owing to the fact that the recognition performance has still got a lot of room for improvement. Recognition of characters lies at the heart of STR and is a crucial component for a reliable STR system. Most of the current methods heavily rely on discriminative power of local features, such as histograms of oriented gradient (HoG), scale invariant feature transform (SIFT), shape contexts (SC), geometric blur (GB), etc. One of the problems with such methods is that the local features are rasterized in an ad hoc manner to get a single vector for subsequent use in recognition. This rearrangement of features clearly perturbs the spatial correlations that may carry crucial information vis-(&)#224;-vis recognition. Moreover, such approaches, in general, do not take into account the rotational invariance property that often leads to failed recognition in cases where characters in scene images do not occur in upright position. To eliminate this local feature dependency and the associated problems, we propose the following three holistic solutions: The first one is based on modelling character images of a class as a 3-mode tensor and then factoring it into a set of rank-1 matrices and the associated mixing coefficients. Each set of rank-1 matrices spans the solution subspace of a specific image class and enables us to capture the required holistic signature for each character class along with the mixing coefficients associated with each character image. During recognition, we project each test image onto the candidate subspaces to derive its mixing coefficients, which are eventually used for final classification.The second approach we study in this work lets us form a novel holistic feature for character recognition based on active contour model, also known as snakes. Our feature vector is based on two variables, direction and distance, cumulatively traversed by each point as the initial circular contour evolves under the force field induced by the character image. The initial contour design in conjunction with cross-correlation based similarity metric enables us to account for rotational variance in the character image. Our third approach is based on modelling a 3-mode tensor via rotation of a single image. This is different from our tensor based approach described above in that we form the tensor using a single image instead of collecting a specific number of samples of a particular class. In this case, to generate a 3D image cube, we rotate an image through a predefined range of angles. This enables us to explicitly capture rotational variance and leads to better performance than various local approaches.Finally, as an application, we use our holistic model to recognize word images extracted from natural scenes. Here we first use our novel word segmentation method based on image seam analysis to split a scene word into individual character images. We then apply our holistic model to recognize individual letters and use a spell-checker module to get the final word prediction. Throughout our work, we employ popular scene text datasets, like Chars74K-Font, Chars74K-Image, SVT, and ICDAR03, which include synthetic and natural image sets, to test the performance of our strategies. We compare results of our recognition models with several baseline methods and show comparable or better performance than several local feature-based methods justifying thus the importance of holistic strategies.
Show less - Date Issued
- 2016
- Identifier
- CFE0006247, ucf:51076
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006247
- Title
- RECOGNIZING TEAMWORK ACTIVITY IN OBSERVATIONS OF EMBODIED AGENTS.
- Creator
-
Luotsinen, Linus, Boloni, Lotzi, University of Central Florida
- Abstract / Description
-
This thesis presents contributions to the theory and practice of team activity recognition. A particular focus of our work was to improve our ability to collect and label representative samples, thus making the team activity recognition more efficient. A second focus of our work is improving the robustness of the recognition process in the presence of noisy and distorted data. The main contributions of this thesis are as follows: We developed a software tool, the Teamwork Scenario Editor (TSE...
Show moreThis thesis presents contributions to the theory and practice of team activity recognition. A particular focus of our work was to improve our ability to collect and label representative samples, thus making the team activity recognition more efficient. A second focus of our work is improving the robustness of the recognition process in the presence of noisy and distorted data. The main contributions of this thesis are as follows: We developed a software tool, the Teamwork Scenario Editor (TSE), for the acquisition, segmentation and labeling of teamwork data. Using the TSE we acquired a corpus of labeled team actions both from synthetic and real world sources. We developed an approach through which representations of idealized team actions can be acquired in form of Hidden Markov Models which are trained using a small set of representative examples segmented and labeled with the TSE. We developed set of team-oriented feature functions, which extract discrete features from the high-dimensional continuous data. The features were chosen such that they mimic the features used by humans when recognizing teamwork actions. We developed a technique to recognize the likely roles played by agents in teams even before the team action was recognized. Through experimental studies we show that the feature functions and role recognition module significantly increase the recognition accuracy, while allowing arbitrary shuffled inputs and noisy data.
Show less - Date Issued
- 2007
- Identifier
- CFE0001876, ucf:47409
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001876
- Title
- Thinking Fast and Missing the Opportunity: An investigation into cognitive processing style and opportunity recognition.
- Creator
-
Letwin, Chaim, Ford, Cameron, Folger, Robert, Schminke, Marshall, Ciuchta, Michael, University of Central Florida
- Abstract / Description
-
Research on opportunity recognition and entrepreneurial cognition suggests that entrepreneurs are likely to use and potentially benefit from heuristics (Baron, 1998, 2004; Busenitz (&) Barney, 1997). Some heuristics, particularly well-refined and accurate prototypes, may be valuable to entrepreneurs in recognizing opportunities (Baron, 2004). I seek, however, to consider how other types of heuristics that lead to irrational, biased, and inaccurate judgments (e.g., the betrayal heuristic)...
Show moreResearch on opportunity recognition and entrepreneurial cognition suggests that entrepreneurs are likely to use and potentially benefit from heuristics (Baron, 1998, 2004; Busenitz (&) Barney, 1997). Some heuristics, particularly well-refined and accurate prototypes, may be valuable to entrepreneurs in recognizing opportunities (Baron, 2004). I seek, however, to consider how other types of heuristics that lead to irrational, biased, and inaccurate judgments (e.g., the betrayal heuristic) relate to opportunity recognition (Baron, 2004; Kahneman (&) Lovallo, 1993). I specifically consider the underlying causal process through which the use of these types of heuristics diminishes the ability to recognize opportunities. I posit that these heuristics reduce the ability to recognize opportunities by causing entrepreneurs to consider less information regarding potential opportunities. Further, I propose two individual differences that allow certain entrepreneurs to mitigate the negative effect that these bias-causing heuristics have on entrepreneurs' ability of form the belief that they have recognized an opportunity. I test my theory with two experimental designs that use a product from a technology transfer office that has been licensed by entrepreneurs and applied to a real-world market. This allows me to isolate the underlying variables of interest and to affix my theorizing to a well-documented phenomenon (the licensing and application of tech-transfer technology/products by entrepreneurs) (Gregoire (&) Shepherd, 2012; Mowery, 2004; Shane, 2001). Results show that some heuristic may cause individuals to consider less information about an opportunity, which reduces their likelihood of forming an opportunity recognition belief. Post hoc analyses suggest that this indirect effect may be conditional on how reflective an individual is and that entrepreneurs may be more reflective than non-entrepreneurs. The major contribution of this dissertation is to examine the theoretical underpinnings as to why certain types of heuristics inhibit entrepreneurs from forming the belief that they have recognized an opportunity. Specifically, I suggest and show that bias-causing heuristics reduce the amount of information that entrepreneurs consider about an opportunity and, as such, inhibit opportunity recognition beliefs. Second, I provide some support for the notion that reflective individuals are more likely to form the belief that they have recognized an opportunity because they consider more information about the opportunity when they initially rely on a bias-causing heuristic. Lastly, this dissertation provides initial support for the notion that entrepreneurs may be more reflective than non-entrepreneurs. Overall, I hope to point out that although a heuristic-dependent processing style has been shown to be beneficial with regard to opportunity recognition (Baron, 2004), the failure to consider the downside of certain heuristics and benefits related to overcoming these heuristics may limit our understanding of the opportunity recognition process.
Show less - Date Issued
- 2015
- Identifier
- CFE0005648, ucf:50163
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005648
- Title
- ON ADVANCED TEMPLATE-BASED INTERPRETATION AS APPLIED TO INTENTION RECOGNITION IN A STRATEGIC ENVIRONMENT.
- Creator
-
Akridge, Cameron, Gonzalez, Avelino, University of Central Florida
- Abstract / Description
-
An area of study that has received much attention over the past few decades is simulations involving threat assessment in military scenarios. Recently, much research has emerged concerning the recognition of troop movements and formations in non-combat simulations. Additionally, there have been efforts towards the detection and assessment of various types of malicious intentions. One such work by Akridge addressed the issue of Strategic Intention Recognition, but fell short in the detection...
Show moreAn area of study that has received much attention over the past few decades is simulations involving threat assessment in military scenarios. Recently, much research has emerged concerning the recognition of troop movements and formations in non-combat simulations. Additionally, there have been efforts towards the detection and assessment of various types of malicious intentions. One such work by Akridge addressed the issue of Strategic Intention Recognition, but fell short in the detection of tactics that it could not detect without somehow manipulating the environment. Therefore, the aim of this thesis is to address the problem of recognizing an opponent's intent in a strategic environment where the system can think ahead in time to see the agent's plan. To approach the problem, a structured form of knowledge called Template-Based Interpretation is borrowed from the work of others and enhanced to reason in a temporally dynamic simulation.
Show less - Date Issued
- 2007
- Identifier
- CFE0001517, ucf:47146
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001517
- Title
- EFFICIENT ALGORITHMS FOR CORRELATION PATTERN RECOGNITION.
- Creator
-
Ragothaman, Pradeep, Mikhael, Wasfy, University of Central Florida
- Abstract / Description
-
The mathematical operation of correlation is a very simple concept, yet has a very rich history of application in a variety of engineering fields. It is essentially nothing but a technique to measure if and to what degree two signals match each other. Since this is a very basic and universal task in a wide variety of fields such as signal processing, communications, computer vision etc., it has been an important tool. The field of pattern recognition often deals with the task of analyzing...
Show moreThe mathematical operation of correlation is a very simple concept, yet has a very rich history of application in a variety of engineering fields. It is essentially nothing but a technique to measure if and to what degree two signals match each other. Since this is a very basic and universal task in a wide variety of fields such as signal processing, communications, computer vision etc., it has been an important tool. The field of pattern recognition often deals with the task of analyzing signals or useful information from signals and classifying them into classes. Very often, these classes are predetermined, and examples (templates) are available for comparison. This task naturally lends itself to the application of correlation as a tool to accomplish this goal. Thus the field of Correlation Pattern Recognition has developed over the past few decades as an important area of research. From the signal processing point of view, correlation is nothing but a filtering operation. Thus there has been a great deal of work in using concepts from filter theory to develop Correlation Filters for pattern recognition. While considerable work has been to done to develop linear correlation filters over the years, especially in the field of Automatic Target Recognition, a lot of attention has recently been paid to the development of Quadratic Correlation Filters (QCF). QCFs offer the advantages of linear filters while optimizing a bank of these simultaneously to offer much improved performance. This dissertation develops efficient QCFs that offer significant savings in storage requirements and computational complexity over existing designs. Firstly, an adaptive algorithm is presented that is able to modify the QCF coefficients as new data is observed. Secondly, a transform domain implementation of the QCF is presented that has the benefits of lower computational complexity and computational requirements while retaining excellent recognition accuracy. Finally, a two dimensional QCF is presented that holds the potential to further save on storage and computations. The techniques are developed based on the recently proposed Rayleigh Quotient Quadratic Correlation Filter (RQQCF) and simulation results are provided on synthetic and real datasets.
Show less - Date Issued
- 2007
- Identifier
- CFE0001974, ucf:47429
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001974
- Title
- MULTIZOOM ACTIVITY RECOGNITION USING MACHINE LEARNING.
- Creator
-
Smith, Raymond, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
In this thesis we present a system for detection of events in video. First a multiview approach to automatically detect and track heads and hands in a scene is described. Then, by making use of epipolar, spatial, trajectory, and appearance constraints, objects are labeled consistently across cameras (zooms). Finally, we demonstrate a new machine learning paradigm, TemporalBoost, that can recognize events in video. One aspect of any machine learning algorithm is in the feature set used. The...
Show moreIn this thesis we present a system for detection of events in video. First a multiview approach to automatically detect and track heads and hands in a scene is described. Then, by making use of epipolar, spatial, trajectory, and appearance constraints, objects are labeled consistently across cameras (zooms). Finally, we demonstrate a new machine learning paradigm, TemporalBoost, that can recognize events in video. One aspect of any machine learning algorithm is in the feature set used. The approach taken here is to build a large set of activity features, though TemporalBoost itself is able to work with any feature set other boosting algorithms use. We also show how multiple levels of zoom can cooperate to solve problems related to activity recognition.
Show less - Date Issued
- 2005
- Identifier
- CFE0000865, ucf:46658
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000865
- Title
- SPEAKER IDENTIFICATION BASED ON DISCRIMINATIVE VECTOR QUANTIZATION AND DATA FUSION.
- Creator
-
Zhou, Guangyu, Mikhael, Wasfy, University of Central Florida
- Abstract / Description
-
Speaker Identification (SI) approaches based on discriminative Vector Quantization (VQ) and data fusion techniques are presented in this dissertation. The SI approaches based on Discriminative VQ (DVQ) proposed in this dissertation are the DVQ for SI (DVQSI), the DVQSI with Unique speech feature vector space segmentation for each speaker pair (DVQSI-U), and the Adaptive DVQSI (ADVQSI) methods. The difference of the probability distributions of the speech feature vector sets from various...
Show moreSpeaker Identification (SI) approaches based on discriminative Vector Quantization (VQ) and data fusion techniques are presented in this dissertation. The SI approaches based on Discriminative VQ (DVQ) proposed in this dissertation are the DVQ for SI (DVQSI), the DVQSI with Unique speech feature vector space segmentation for each speaker pair (DVQSI-U), and the Adaptive DVQSI (ADVQSI) methods. The difference of the probability distributions of the speech feature vector sets from various speakers (or speaker groups) is called the interspeaker variation between speakers (or speaker groups). The interspeaker variation is the measure of template differences between speakers (or speaker groups). All DVQ based techniques presented in this contribution take advantage of the interspeaker variation, which are not exploited in the previous proposed techniques by others that employ traditional VQ for SI (VQSI). All DVQ based techniques have two modes, the training mode and the testing mode. In the training mode, the speech feature vector space is first divided into a number of subspaces based on the interspeaker variations. Then, a discriminative weight is calculated for each subspace of each speaker or speaker pair in the SI group based on the interspeaker variation. The subspaces with higher interspeaker variations play more important roles in SI than the ones with lower interspeaker variations by assigning larger discriminative weights. In the testing mode, discriminative weighted average VQ distortions instead of equally weighted average VQ distortions are used to make the SI decision. The DVQ based techniques lead to higher SI accuracies than VQSI. DVQSI and DVQSI-U techniques consider the interspeaker variation for each speaker pair in the SI group. In DVQSI, speech feature vector space segmentations for all the speaker pairs are exactly the same. However, each speaker pair of DVQSI-U is treated individually in the speech feature vector space segmentation. In both DVQSI and DVQSI-U, the discriminative weights for each speaker pair are calculated by trial and error. The SI accuracies of DVQSI-U are higher than those of DVQSI at the price of much higher computational burden. ADVQSI explores the interspeaker variation between each speaker and all speakers in the SI group. In contrast with DVQSI and DVQSI-U, in ADVQSI, the feature vector space segmentation is for each speaker instead of each speaker pair based on the interspeaker variation between each speaker and all the speakers in the SI group. Also, adaptive techniques are used in the discriminative weights computation for each speaker in ADVQSI. The SI accuracies employing ADVQSI and DVQSI-U are comparable. However, the computational complexity of ADVQSI is much less than that of DVQSI-U. Also, a novel algorithm to convert the raw distortion outputs of template-based SI classifiers into compatible probability measures is proposed in this dissertation. After this conversion, data fusion techniques at the measurement level can be applied to SI. In the proposed technique, stochastic models of the distortion outputs are estimated. Then, the posteriori probabilities of the unknown utterance belonging to each speaker are calculated. Compatible probability measures are assigned based on the posteriori probabilities. The proposed technique leads to better SI performance at the measurement level than existing approaches.
Show less - Date Issued
- 2005
- Identifier
- CFE0000720, ucf:46621
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000720
- Title
- SELF DESIGNING PATTERN RECOGNITION SYSTEM EMPLOYING MULTISTAGE CLASSIFICATION.
- Creator
-
ABDELWAHAB, MANAL MAHMOUD, Mikhael, Wasfy, University of Central Florida
- Abstract / Description
-
Recently, pattern recognition/classification has received a considerable attention in diverse engineering fields such as biomedical imaging, speaker identification, fingerprint recognition, etc. In most of these applications, it is desirable to maintain the classification accuracy in the presence of corrupted and/or incomplete data. The quality of a given classification technique is measured by the computational complexity, execution time of algorithms, and the number of patterns that can be...
Show moreRecently, pattern recognition/classification has received a considerable attention in diverse engineering fields such as biomedical imaging, speaker identification, fingerprint recognition, etc. In most of these applications, it is desirable to maintain the classification accuracy in the presence of corrupted and/or incomplete data. The quality of a given classification technique is measured by the computational complexity, execution time of algorithms, and the number of patterns that can be classified correctly despite any distortion. Some classification techniques that are introduced in the literature are described in Chapter one.In this dissertation, a pattern recognition approach that can be designed to have evolutionary learning by developing the features and selecting the criteria that are best suited for the recognition problem under consideration is proposed. Chapter two presents some of the features used in developing the set of criteria employed by the system to recognize different types of signals. It also presents some of the preprocessing techniques used by the system. The system operates in two modes, namely, the learning (training) mode, and the running mode. In the learning mode, the original and preprocessed signals are projected into different transform domains. The technique automatically tests many criteria over the range of parameters for each criterion. A large number of criteria are developed from the features extracted from these domains. The optimum set of criteria, satisfying specific conditions, is selected. This set of criteria is employed by the system to recognize the original or noisy signals in the running mode. The modes of operation and the classification structures employed by the system are described in details in Chapter three.The proposed pattern recognition system is capable of recognizing an enormously large number of patterns by virtue of the fact that it analyzes the signal in different domains and explores the distinguishing characteristics in each of these domains. In other words, this approach uses available information and extracts more characteristics from the signals, for classification purposes, by projecting the signal in different domains. Some experimental results are given in Chapter four showing the effect of using mathematical transforms in conjunction with preprocessing techniques on the classification accuracy. A comparison between some of the classification approaches, in terms of classification rate in case of distortion, is also given.A sample of experimental implementations is presented in chapter 5 and chapter 6 to illustrate the performance of the proposed pattern recognition system. Preliminary results given confirm the superior performance of the proposed technique relative to the single transform neural network and multi-input neural network approaches for image classification in the presence of additive noise.
Show less - Date Issued
- 2004
- Identifier
- CFE0000020, ucf:46077
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000020
- Title
- DETECTING CURVED OBJECTS AGAINST CLUTTERED BACKGROUNDS.
- Creator
-
Prokaj, Jan, Lobo, Niels, University of Central Florida
- Abstract / Description
-
Detecting curved objects against cluttered backgrounds is a hard problem in computer vision. We present new low-level and mid-level features to function in these environments. The low-level features are fast to compute, because they employ an integral image approach, which makes them especially useful in real-time applications. The mid-level features are built from low-level features, and are optimized for curved object detection. The usefulness of these features is tested by designing an...
Show moreDetecting curved objects against cluttered backgrounds is a hard problem in computer vision. We present new low-level and mid-level features to function in these environments. The low-level features are fast to compute, because they employ an integral image approach, which makes them especially useful in real-time applications. The mid-level features are built from low-level features, and are optimized for curved object detection. The usefulness of these features is tested by designing an object detection algorithm using these features. Object detection is accomplished by transforming the mid-level features into weak classifiers, which then produce a strong classifier using AdaBoost. The resulting strong classifier is then tested on the problem of detecting heads with shoulders. On a database of over 500 images of people, cropped to contain head and shoulders, and with a diverse set of backgrounds, the detection rate is 90% while the false positive rate on a database of 500 negative images is less than 2%.
Show less - Date Issued
- 2008
- Identifier
- CFE0002102, ucf:47535
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002102
- Title
- Different Facial Recognition Techniques in Transform Domains.
- Creator
-
Al Obaidi, Taif, Mikhael, Wasfy, Atia, George, Jones, W Linwood, Myers, Brent, Moslehy, Faissal, University of Central Florida
- Abstract / Description
-
The human face is frequently used as the biometric signal presented to a machine for identificationpurposes. Several challenges are encountered while designing face identification systems.The challenges are either caused by the process of capturing the face image itself, or occur whileprocessing the face poses. Since the face image not only contains the face, this adds to the datadimensionality, and thus degrades the performance of the recognition system. Face Recognition(FR) has been a major...
Show moreThe human face is frequently used as the biometric signal presented to a machine for identificationpurposes. Several challenges are encountered while designing face identification systems.The challenges are either caused by the process of capturing the face image itself, or occur whileprocessing the face poses. Since the face image not only contains the face, this adds to the datadimensionality, and thus degrades the performance of the recognition system. Face Recognition(FR) has been a major signal processing topic of interest in the last few decades. Most commonapplications of the FR include, forensics, access authorization to facilities, or simply unlockingof a smart phone. The three factors governing the performance of a FR system are: the storagerequirements, the computational complexity, and the recognition accuracy. The typical FR systemconsists of the following main modules in each of the Training and Testing phases: Preprocessing,Feature Extraction, and Classification. The ORL, YALE, FERET, FEI, Cropped AR, and GeorgiaTech datasets are used to evaluate the performance of the proposed systems. The proposed systemsare categorized into Single-Transform and Two-Transform systems. In the first category, the featuresare extracted from a single domain, that of the Two-Dimensional Discrete Cosine Transform(2D DCT). In the latter category, the Two-Dimensional Discrete Wavelet Transform (2D DWT)coefficients are combined with those of the 2D DCT to form one feature vector. The feature vectorsare either used directly or further processed to obtain the persons' final models. The PrincipleComponent Analysis (PCA), the Sparse Representation, Vector Quantization (VQ) are employedas a second step in the Feature Extraction Module. Additionally, a technique is proposed in whichthe feature vector is composed of appropriately selected 2D DCT and 2D DWT coefficients basedon a residual minimization algorithm.
Show less - Date Issued
- 2018
- Identifier
- CFE0007146, ucf:52295
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007146
- Title
- STUDY OF HUMAN ACTIVITY IN VIDEO DATA WITH AN EMPHASIS ON VIEW-INVARIANCE.
- Creator
-
Ashraf, Nazim, Foroosh, Hassan, Hughes, Charles, Tappen, Marshall, Moshell, Jack, University of Central Florida
- Abstract / Description
-
The perception and understanding of human motion and action is an important area of research in computer vision that plays a crucial role in various applications such as surveillance, HCI, ergonomics, etc. In this thesis, we focus on the recognition of actions in the case of varying viewpoints and different and unknown camera intrinsic parameters. The challenges to be addressed include perspective distortions, differences in viewpoints, anthropometric variations,and the large degrees of...
Show moreThe perception and understanding of human motion and action is an important area of research in computer vision that plays a crucial role in various applications such as surveillance, HCI, ergonomics, etc. In this thesis, we focus on the recognition of actions in the case of varying viewpoints and different and unknown camera intrinsic parameters. The challenges to be addressed include perspective distortions, differences in viewpoints, anthropometric variations,and the large degrees of freedom of articulated bodies. In addition, we are interested in methods that require little or no training. The current solutions to action recognition usually assume that there is a huge dataset of actions available so that a classifier can be trained. However, thismeans that in order to define a new action, the user has to record a number of videos fromdifferent viewpoints with varying camera intrinsic parameters and then retrain the classifier, which is not very practical from a development point of view. We propose algorithms that overcome these challenges and require just a few instances of the action from any viewpoint with any intrinsic camera parameters. Our first algorithm is based on the rank constraint on the family of planar homographies associated with triplets of body points. We represent action as a sequence of poses, and decompose the pose into triplets. Therefore, the pose transition is brokendown into a set of movement of body point planes. In this way, we transform the non-rigid motion of the body points into a rigid motion of body point planes. We use the fact that the family of homographies associated with two identical poses would have rank 4 to gauge similarity of the pose between two subjects, observed by different perspective cameras and from different viewpoints. This method requires only one instance of the action. We then show that it is possible to extend the concept of triplets to line segments. In particular, we establish that if we look at the movement of line segments instead of triplets, we have more redundancy in data thus leading to better results. We demonstrate this concept on (")fundamental ratios.(") We decompose a human body pose into line segments instead of triplets and look at set of movement of line segments. This method needs only three instances of the action. If a larger dataset is available, we can also apply weighting on line segments for better accuracy. The last method is based onthe concept of (")Projective Depth("). Given a plane, we can find the relative depth of a point relative to the given plane. We propose three different ways of using (")projective depth:(") (i) Triplets - the three points of a triplet along with the epipole defines the plane and the movement of points relative to these body planes can be used to recognize actions; (ii) Ground plane - if we are able to extract the ground plane, we can find the (")projective depth(") of the body points withrespect to it. Therefore, the problem of action recognition would translate to curve matching; and (iii) Mirror person (-) We can use the mirror view of the person to extract mirror symmetric planes. This method also needs only one instance of the action. Extensive experiments are reported on testing view invariance, robustness to noisy localization and occlusions of bodypoints, and action recognition. The experimental results are very promising and demonstrate the efficiency of our proposed invariants.
Show less - Date Issued
- 2012
- Identifier
- CFE0004352, ucf:49449
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004352
- Title
- FACIAL EMOTION RECOGNITION IN CHILDREN WITH ASPERGER'S DISORDER AND IN CHILDREN WITH SOCIAL PHOBIA.
- Creator
-
Wong, Nina, Beidel, Deborah, University of Central Florida
- Abstract / Description
-
Recognizing emotion from facial expressions is an essential skill for effective social functioning and establishing interpersonal relationships. AspergerÃÂ's Disorder (AD) and Social Phobia (SP) are two clinical populations showing impairment in social skill and perhaps emotion recognition. Objectives: The primary objectives were to determine the uniqueness of facial emotion recognition abilities between children with AD and SP relative to typically developing children ...
Show moreRecognizing emotion from facial expressions is an essential skill for effective social functioning and establishing interpersonal relationships. AspergerÃÂ's Disorder (AD) and Social Phobia (SP) are two clinical populations showing impairment in social skill and perhaps emotion recognition. Objectives: The primary objectives were to determine the uniqueness of facial emotion recognition abilities between children with AD and SP relative to typically developing children (TD) and to examine the role of expression intensity in determining recognition of facial affect. Method: Fifty-seven children (19 AD, 17 SP, and 21 TD) aged 7-13 years participated in the study. Reaction times and accuracy were measured as children identified neutral faces and faces displaying anger, disgust, fear, happiness, and sadness at two different intensity levels. Results: Mixed model ANOVAs with group and emotion type revealed that all children responded faster and more accurately to expressions of happiness, but there were no other group differences. Additional analyses indicated that intensity of the displayed emotion influenced facial affect detection ability for several basic emotions (happiness, fear, and anger). Across groups, there was no pattern of specific misidentification of emotion (e.g., children did not consistently misidentify one emotion, such as disgust, for a different emotion, such as anger.) Finally, facial affect recognition abilities were not associated with behavioral ratings of overall anxiety or social skills effectiveness in structured role play interactions. Conclusions: Distinct facial affect recognition deficits in the clinical groups emerge when the intensity of the emotion expression is considered. Implications for using behavioral assessments to delineate the relationship between facial affect recognition abilities and social functioning among clinical populations are discussed.
Show less - Date Issued
- 2010
- Identifier
- CFE0003053, ucf:48336
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003053
- Title
- SPATIO-TEMPORAL MAXIMUM AVERAGE CORRELATION HEIGHT TEMPLATES IN ACTION RECOGNITION AND VIDEO SUMMARIZATION.
- Creator
-
Rodriguez, Mikel, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Action recognition represents one of the most difficult problems in computer vision given that it embodies the combination of several uncertain attributes, such as the subtle variability associated with individual human behavior and the challenges that come with viewpoint variations, scale changes and different temporal extents. Nevertheless, action recognition solutions are critical in a great number of domains, such video surveillance, assisted living environments, video search, interfaces,...
Show moreAction recognition represents one of the most difficult problems in computer vision given that it embodies the combination of several uncertain attributes, such as the subtle variability associated with individual human behavior and the challenges that come with viewpoint variations, scale changes and different temporal extents. Nevertheless, action recognition solutions are critical in a great number of domains, such video surveillance, assisted living environments, video search, interfaces, and virtual reality. In this dissertation, we investigate template-based action recognition algorithms that can incorporate the information contained in a set of training examples, and we explore how these algorithms perform in action recognition and video summarization. First, we introduce a template-based method for recognizing human actions called Action MACH. Our approach is based on a Maximum Average Correlation Height (MACH) filter. MACH is capable of capturing intra-class variability by synthesizing a single Action MACH filter for a given action class. We generalize the traditional MACH filter to video (3D spatiotemporal volume), and vector valued data. By analyzing the response of the filter in the frequency domain, we avoid the high computational cost commonly incurred in template-based approaches. Vector valued data is analyzed using the Clifford Fourier transform, a generalization of the Fourier transform intended for both scalar and vector-valued data. Next, we address three seldom explored challenges in template-based action recognition. The first is the recognition and localization of human actions in aerial videos obtained from unmanned aerial vehicles (UAVs), a new medium which presents unique challenges due to the small number of pixels per human, pose, and moving camera. The second issue we address is the incorporation of multiple positive and negative examples of a target action class when generating an action template. We address this issue by employing the Fukunaga-Koontz Transform as a means of generating a single quadratic template which, unlike traditional temporal templates (which rely on positive examples alone), effectively captures the variability associated with an action class by including both positive and negative examples in the template training process. Third, we explore the problem of generating video summaries that include specific actions of interest as opposed to all moving objects. In doing so, we explore the role of action templates in video summarization in an effort to provide a means of generating a compact video representation based on a set of activities of interest. We introduce an approach in which a user specifies the activities that interest him and the video is automatically condensed to a short clip which captures the most relevant events based on the user's preference. We follow the output summary video format of non-chronological video synopsis approaches, in which different events which occur at different times may be displayed concurrently, even though they never occur simultaneously in the original video. However, instead of assuming that all moving objects are interesting, priority is given to specific activities of interest which pertain to a user's query. This provides an efficient means of browsing through large collections of video for events of interest.
Show less - Date Issued
- 2010
- Identifier
- CFE0003313, ucf:48507
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003313
- Title
- Using Freebase, an Automatically Generated Dictionary, and a Classifier to Identify a Person's Profession in Tweets.
- Creator
-
Hall, Abraham, Gomez, Fernando, Dechev, Damian, Tappen, Marshall, University of Central Florida
- Abstract / Description
-
Algorithms for classifying pre-tagged person entities in tweets into one of eight profession categories are presented. A classifier using a semi-supervised learning algorithm that takes into consideration the local context surrounding the entity in the tweet, hash tag information, and topic signature scores is described. In addition to the classifier, this research investigates two dictionaries containing the professions of persons. These two dictionaries are used in their own classification...
Show moreAlgorithms for classifying pre-tagged person entities in tweets into one of eight profession categories are presented. A classifier using a semi-supervised learning algorithm that takes into consideration the local context surrounding the entity in the tweet, hash tag information, and topic signature scores is described. In addition to the classifier, this research investigates two dictionaries containing the professions of persons. These two dictionaries are used in their own classification algorithms which are independent of the classifier. The method for creating the first dictionary dynamically from the web and the algorithm that accesses this dictionary to classify a person into one of the eight profession categories are explained next. The second dictionary is freebase, an openly available online database that is maintained by its online community. The algorithm that uses freebase for classifying a person into one of the eight professions is described. The results also show that classifications made using the automated constructed dictionary, freebase, or the classifier are all moderately successful. The results also show that classifications made with the automated constructed person dictionary are slightly more accurate than classifications made using freebase. Various hybrid methods, combining the classifier and the two dictionaries are also explained. The results of those hybrid methods show significant improvement over any of the individual methods.
Show less - Date Issued
- 2013
- Identifier
- CFE0004858, ucf:49715
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004858
- Title
- ADAPTIVE INTELLIGENT USER INTERFACES WITH EMOTION RECOGNITION.
- Creator
-
NASOZ, FATMA, Christine Lisetti, Dr L., University of Central Florida
- Abstract / Description
-
The focus of this dissertation is on creating Adaptive Intelligent User Interfaces to facilitate enhanced natural communication during the Human-Computer Interaction by recognizing users' affective states (i.e., emotions experienced by the users) and responding to those emotions by adapting to the current situation via an affective user model created for each user. Controlled experiments were designed and conducted in a laboratory environment and in a Virtual Reality environment to collect...
Show moreThe focus of this dissertation is on creating Adaptive Intelligent User Interfaces to facilitate enhanced natural communication during the Human-Computer Interaction by recognizing users' affective states (i.e., emotions experienced by the users) and responding to those emotions by adapting to the current situation via an affective user model created for each user. Controlled experiments were designed and conducted in a laboratory environment and in a Virtual Reality environment to collect physiological data signals from participants experiencing specific emotions. Algorithms (k-Nearest Neighbor [KNN], Discriminant Function Analysis [DFA], Marquardt-Backpropagation [MBP], and Resilient Backpropagation [RBP]) were implemented to analyze the collected data signals and to find unique physiological patterns of emotions. Emotion Elicitation with Movie Clips Experiment was conducted to elicit Sadness, Anger, Surprise, Fear, Frustration, and Amusement from participants. Overall, the three algorithms: KNN, DFA, and MBP, could recognize emotions with 72.3%, 75.0%, and 84.1% accuracy, respectively. Driving Simulator experiment was conducted to elicit driving-related emotions and states (panic/fear, frustration/anger, and boredom/sleepiness). The KNN, MBP and RBP Algorithms were used to classify the physiological signals by corresponding emotions. Overall, KNN could classify these three emotions with 66.3%, MBP could classify them with 76.7% and RBP could classify them with 91.9% accuracy. Adaptation of the interface was designed to provide multi-modal feedback to the users about their current affective state and to respond to users' negative emotional states in order to decrease the possible negative impacts of those emotions. Bayesian Belief Networks formalization was employed to develop the User Model to enable the intelligent system to appropriately adapt to the current context and situation by considering user-dependent factors, such as: personality traits and preferences.
Show less - Date Issued
- 2004
- Identifier
- CFE0000126, ucf:46201
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000126
- Title
- OPTICAL CHARACTER RECOGNITION: A STATISTICAL MODEL OF MULTI-ENGINE OPTICAL CHARACTER RECOGNITION SYSTEMS.
- Creator
-
McDonald, Mercedes Terre, M Richie, Samuel, University of Central Florida
- Abstract / Description
-
This thesis is a benchmark performed on three commercial Optical Character Recognition (OCR) engines. The purpose of this benchmark is to characterize the performance of the OCR engines with emphasis on the correlation of errors between each engine. The benchmarks are performed for the evaluation of the effect of a multi-OCR system employing a voting scheme to increase overall recognition accuracy. This is desirable since currently OCR systems are still unable to recognize characters with 100...
Show moreThis thesis is a benchmark performed on three commercial Optical Character Recognition (OCR) engines. The purpose of this benchmark is to characterize the performance of the OCR engines with emphasis on the correlation of errors between each engine. The benchmarks are performed for the evaluation of the effect of a multi-OCR system employing a voting scheme to increase overall recognition accuracy. This is desirable since currently OCR systems are still unable to recognize characters with 100% accuracy. The existing error rates of OCR engines pose a major problem for applications where a single error can possibly effect significant outcomes, such as in legal applications. The results obtained from this benchmark are the primary determining factor in the decision of implementing a voting scheme. The experiment performed displayed a very high accuracy rate for each of these commercial OCR engines. The average accuracy rate found for each engine was near 99.5% based on a less than 6,000 word document. While these error rates are very low, the goal is 100% accuracy in legal applications. Based on the work in this thesis, it has been determined that a simple voting scheme will help to improve the accuracy rate.
Show less - Date Issued
- 2004
- Identifier
- CFE0000123, ucf:46188
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000123
- Title
- Holistic Representations for Activities and Crowd Behaviors.
- Creator
-
Solmaz, Berkan, Shah, Mubarak, Da Vitoria Lobo, Niels, Jha, Sumit, Ilie, Marcel, Moore, Brian, University of Central Florida
- Abstract / Description
-
In this dissertation, we address the problem of analyzing the activities of people in a variety of scenarios, this is commonly encountered in vision applications. The overarching goal is to devise new representations for the activities, in settings where individuals or a number of people may take a part in specific activities. Different types of activities can be performed by either an individual at the fine level or by several people constituting a crowd at the coarse level. We take into...
Show moreIn this dissertation, we address the problem of analyzing the activities of people in a variety of scenarios, this is commonly encountered in vision applications. The overarching goal is to devise new representations for the activities, in settings where individuals or a number of people may take a part in specific activities. Different types of activities can be performed by either an individual at the fine level or by several people constituting a crowd at the coarse level. We take into account the domain specific information for modeling these activities. The summary of the proposed solutions is presented in the following.The holistic description of videos is appealing for visual detection and classification tasks for several reasons including capturing the spatial relations between the scene components, simplicity, and performance [1, 2, 3]. First, we present a holistic (global) frequency spectrum based descriptor for representing the atomic actions performed by individuals such as: bench pressing, diving, hand waving, boxing, playing guitar, mixing, jumping, horse riding, hula hooping etc. We model and learn these individual actions for classifying complex user uploaded videos. Our method bypasses the detection of interest points, the extraction of local video descriptors and the quantization of local descriptors into a code book; it represents each video sequence as a single feature vector. This holistic feature vector is computed by applying a bank of 3-D spatio-temporal filters on the frequency spectrum of a video sequence; hence it integrates the information about the motion and scene structure. We tested our approach on two of the most challenging datasets, UCF50 [4] and HMDB51 [5], and obtained promising results which demonstrates the robustness and the discriminative power of our holistic video descriptor for classifying videos of various realistic actions.In the above approach, a holistic feature vector of a video clip is acquired by dividing the video into spatio-temporal blocks then concatenating the features of the individual blocks together. However, such a holistic representation blindly incorporates all the video regions regardless of their contribution in classification. Next, we present an approach which improves the performance of the holistic descriptors for activity recognition. In our novel method, we improve the holistic descriptors by discovering the discriminative video blocks. We measure the discriminativity of a block by examining its response to a pre-learned support vector machine model. In particular, a block is considered discriminative if it responds positively for positive training samples, and negatively for negative training samples. We pose the problem of finding the optimal blocks as a problem of selecting a sparse set of blocks, which maximizes the total classifier discriminativity. Through a detailed set of experiments on benchmark datasets [6, 7, 8, 9, 5, 10], we show that our method discovers the useful regions in the videos and eliminates the ones which are confusing for classification, which results in significant performance improvement over the state-of-the-art.In contrast to the scenes where an individual performs a primitive action, there may be scenes with several people, where crowd behaviors may take place. For these types of scenes the traditional approaches for recognition will not work due to severe occlusion and computational requirements. The number of videos is limited and the scenes are complicated, hence learning these behaviors is not feasible. For this problem, we present a novel approach, based on the optical flow in a video sequence, for identifying five specific and common crowd behaviors in visual scenes. In the algorithm, the scene is overlaid by a grid of particles, initializing a dynamical system which is derived from the optical flow. Numerical integration of the optical flow provides particle trajectories that represent the motion in the scene. Linearization of the dynamical system allows a simple and practical analysis and classification of the behavior through the Jacobian matrix. Essentially, the eigenvalues of this matrix are used to determine the dynamic stability of points in the flow and each type of stability corresponds to one of the five crowd behaviors. The identified crowd behaviors are (1) bottlenecks: where many pedestrians/vehicles from various points in the scene are entering through one narrow passage, (2) fountainheads: where many pedestrians/vehicles are emerging from a narrow passage only to separate in many directions, (3) lanes: where many pedestrians/vehicles are moving at the same speeds in the same direction, (4) arches or rings: where the collective motion is curved or circular, and (5) blocking: where there is a opposing motion and desired movement of groups of pedestrians is somehow prohibited. The implementation requires identifying a region of interest in the scene, and checking the eigenvalues of the Jacobian matrix in that region to determine the type of flow, that corresponds to various well-defined crowd behaviors. The eigenvalues are only considered in these regions of interest, consistent with the linear approximation and the implied behaviors. Since changes in eigenvalues can mean changes in stability, corresponding to changes in behavior, we can repeat the algorithm over clips of long video sequences to locate changes in behavior. This method was tested on over real videos representing crowd and traffic scenes.
Show less - Date Issued
- 2013
- Identifier
- CFE0004941, ucf:49638
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004941