Current Search: recognition (x)
View All Items
Pages
- Title
- MULTI-VIEW GEOMETRIC CONSTRAINTS FOR HUMAN ACTION RECOGNITION AND TRACKING.
- Creator
-
GRITAI, ALEXEI, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Human actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints...
Show moreHuman actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints of two actors in the same posture can be described by a 3D rigid transformation. This transformation simultaneously captures different poses and various sizes and proportions. As a consequence of this conjecture, we show that there exists a fundamental matrix between the imaged positions of the body joints of two actors, if they are in the same posture. Second, we propose a novel projection model for cameras moving at a constant velocity in 3D space, \emph cameras, and derive the Galilean fundamental matrix and apply it to human action recognition. Third, we propose a novel use for the invariant ratio of areas under an affine transformation and utilizing the epipolar geometry between two cameras for 2D model-based tracking of human body joints. In the first part of the thesis, we propose an approach to match human actions using semantic correspondences between human bodies. These correspondences are used to provide geometric constraints between multiple anatomical landmarks ( e.g. hands, shoulders, and feet) to match actions observed from different viewpoints and performed at different rates by actors of differing anthropometric proportions. The fact that the human body has approximate anthropometric proportion allows for innovative use of the machinery of epipolar geometry to provide constraints for analyzing actions performed by people of different anthropometric sizes, while ensuring that changes in viewpoint do not affect matching. A novel measure in terms of rank of matrix constructed only from image measurements of the locations of anatomical landmarks is proposed to ensure that similar actions are accurately recognized. Finally, we describe how dynamic time warping can be used in conjunction with the proposed measure to match actions in the presence of nonlinear time warps. We demonstrate the versatility of our algorithm in a number of challenging sequences and applications including action synchronization , odd one out, following the leader, analyzing periodicity etc. Next, we extend the conventional model of image projection to video captured by a camera moving at constant velocity. We term such moving camera Galilean camera. To that end, we derive the spacetime projection and develop the corresponding epipolar geometry between two Galilean cameras. Both perspective imaging and linear pushbroom imaging form specializations of the proposed model and we show how six different ``fundamental" matrices including the classic fundamental matrix, the Linear Pushbroom (LP) fundamental matrix, and a fundamental matrix relating Epipolar Plane Images (EPIs) are related and can be directly recovered from a Galilean fundamental matrix. We provide linear algorithms for estimating the parameters of the the mapping between videos in the case of planar scenes. For applying fundamental matrix between Galilean cameras to human action recognition, we propose a measure that has two important properties. First property makes it possible to recognize similar actions, if their execution rates are linearly related. Second property allows recognizing actions in video captured by Galilean cameras. Thus, the proposed algorithm guarantees that actions can be correctly matched despite changes in view, execution rate, anthropometric proportions of the actor, and even if the camera moves with constant velocity. Finally, we also propose a novel 2D model based approach for tracking human body parts during articulated motion. The human body is modeled as a 2D stick figure of thirteen body joints and an action is considered as a sequence of these stick figures. Given the locations of these joints in every frame of a model video and the first frame of a test video, the joint locations are automatically estimated throughout the test video using two geometric constraints. First, invariance of the ratio of areas under an affine transformation is used for initial estimation of the joint locations in the test video. Second, the epipolar geometry between the two cameras is used to refine these estimates. Using these estimated joint locations, the tracking algorithm determines the exact location of each landmark in the test video using the foreground silhouettes. The novelty of the proposed approach lies in the geometric formulation of human action models, the combination of the two geometric constraints for body joints prediction, and the handling of deviations in anthropometry of individuals, viewpoints, execution rate, and style of performing action. The proposed approach does not require extensive training and can easily adapt to a wide variety of articulated actions.
Show less - Date Issued
- 2007
- Identifier
- CFE0001692, ucf:47199
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001692
- Title
- Spatial and Temporal Modeling for Human Activity Recognition from Multimodal Sequential Data.
- Creator
-
Ye, Jun, Hua, Kien, Foroosh, Hassan, Zou, Changchun, Karwowski, Waldemar, University of Central Florida
- Abstract / Description
-
Human Activity Recognition (HAR) has been an intense research area for more than a decade. Different sensors, ranging from 2D and 3D cameras to accelerometers, gyroscopes, and magnetometers, have been employed to generate multimodal signals to detect various human activities. With the advancement of sensing technology and the popularity of mobile devices, depth cameras and wearable devices, such as Microsoft Kinect and smart wristbands, open a unprecedented opportunity to solve the...
Show moreHuman Activity Recognition (HAR) has been an intense research area for more than a decade. Different sensors, ranging from 2D and 3D cameras to accelerometers, gyroscopes, and magnetometers, have been employed to generate multimodal signals to detect various human activities. With the advancement of sensing technology and the popularity of mobile devices, depth cameras and wearable devices, such as Microsoft Kinect and smart wristbands, open a unprecedented opportunity to solve the challenging HAR problem by learning expressive representations from the multimodal signals recording huge amounts of daily activities which comprise a rich set of categories.Although competitive performance has been reported, existing methods focus on the statistical or spatial representation of the human activity sequence;while the internal temporal dynamics of the human activity sequence arenot sufficiently exploited. As a result, they often face the challenge of recognizing visually similar activities composed of dynamic patterns in different temporal order. In addition, many model-driven methods based on sophisticated features and carefully-designed classifiers are computationally demanding and unable to scale to a large dataset. In this dissertation, we propose to address these challenges from three different perspectives; namely, 3D spatial relationship modeling, dynamic temporal quantization, and temporal order encoding.We propose a novel octree-based algorithm for computing the 3D spatial relationships between objects from a 3D point cloud captured by a Kinect sensor. A set of 26 3D spatial directions are defined to describe the spatial relationship of an object with respect to a reference object. These 3D directions are implemented as a set of spatial operators, such as "AboveSouthEast" and "BelowNorthWest," of an event query language to query human activities in an indoor environment; for example, "A person walks in the hallway from north to south." The performance is quantitatively evaluated in a public RGBD object dataset and qualitatively investigated in a live video computing platform.In order to address the challenge of temporal modeling in human action recognition, we introduce the dynamic temporal quantization, a clustering-like algorithm to quantize human action sequences of varied lengths into fixed-size quantized vectors. A two-step optimization algorithm is proposed to jointly optimize the quantization of the original sequence. In the aggregation step, frames falling into the sample segment are aggregated by max-polling and produce the quantized representation of the segment. During the assignment step, frame-segment assignment is updated according to dynamic time warping, while the temporal order of the entire sequence is preserved. The proposed technique is evaluated on three public 3D human action datasets and achieves state-of-the-art performance.Finally, we propose a novel temporal order encoding approach that models the temporal dynamics of the sequential data for human activity recognition. The algorithm encodes the temporal order of the latent patterns extracted by the subspace projection and generates a highly compact First-Take-All (FTA) feature vector representing the entire sequential data. An optimization algorithm is further introduced to learn the optimized projections in order to increase the discriminative power of the FTA feature. The compactness of the FTA feature makes it extremely efficient for human activity recognition with nearest neighbor search based on Hamming distance. Experimental results on two public human activity datasets demonstrate the advantages of the FTA feature over state-of-the-art methods in both accuracy and efficiency.
Show less - Date Issued
- 2016
- Identifier
- CFE0006516, ucf:51367
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006516
- Title
- AUTONOMOUS REPAIR OF OPTICAL CHARACTER RECOGNITION DATA THROUGH SIMPLE VOTING AND MULTI-DIMENSIONAL INDEXING TECHNIQUES.
- Creator
-
Sprague, Christopher, Weeks, Arthur, University of Central Florida
- Abstract / Description
-
The three major optical character recognition (OCR) engines (ExperVision, Scansoft OCR, and Abby OCR) in use today are all capable of recognizing text at near perfect percentages. The remaining errors however have proven very difficult to identify within a single engine. Recent research has shown that a comparison between the errors of the three engines proved to have very little correlation, and thus, when used in conjunction, may be useful to increase accuracy of the final result. This...
Show moreThe three major optical character recognition (OCR) engines (ExperVision, Scansoft OCR, and Abby OCR) in use today are all capable of recognizing text at near perfect percentages. The remaining errors however have proven very difficult to identify within a single engine. Recent research has shown that a comparison between the errors of the three engines proved to have very little correlation, and thus, when used in conjunction, may be useful to increase accuracy of the final result. This document discusses the implementation and results of a simple voting system designed to prove the hypothesis and show a statistical improvement in overall accuracy. Additional aspects of implementing an improved OCR scheme such as dealing with multiple engine data output alignment and recognizing application specific solutions are also addressed in this research. Although voting systems are currently in use by many major OCR engine developers, this research focuses on the addition of a collaborative system which is able to utilize the various positive aspects of multiple engines while also addressing the immediate need for practical industry applications such as litigation and forms processing. Doculex TM, a major developer and leader in the document imaging industry, has provided the funding for this research.
Show less - Date Issued
- 2005
- Identifier
- CFE0000380, ucf:46337
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000380
- Title
- AUTOMATED REGRESSION TESTING APPROACH TO EXPANSION AND REFINEMENT OF SPEECH RECOGNITION GRAMMARS.
- Creator
-
Dookhoo, Raul, DeMara, Ronald, University of Central Florida
- Abstract / Description
-
This thesis describes an approach to automated regression testing for speech recognition grammars. A prototype Audio Regression Tester called ART has been developed using Microsoft's Speech API and C#. ART allows a user to perform any of three tasks: automatically generate a new XML-based grammar file from standardized SQL database entries, record and cross-reference audio files for use by an underlying speech recognition engine, and perform regression tests with the aid of an oracle...
Show moreThis thesis describes an approach to automated regression testing for speech recognition grammars. A prototype Audio Regression Tester called ART has been developed using Microsoft's Speech API and C#. ART allows a user to perform any of three tasks: automatically generate a new XML-based grammar file from standardized SQL database entries, record and cross-reference audio files for use by an underlying speech recognition engine, and perform regression tests with the aid of an oracle grammar. ART takes as input a wave sound file containing speech and a newly created XML grammar file. It then simultaneously executes two tests: one with the wave file and the new grammar file and the other with the wave file and the oracle grammar. The comparison result of the tests is used to determine whether the test was successful or not. This allows rapid exhaustive evaluations of additions to grammar files to guarantee forward process as the complexity of the voice domain grows. The data used in this research to derive results were taken from the LifeLike project. However, the capabilities of ART extend beyond LifeLike. The results gathered have shown that using a person's recorded voice to do regression testing is as effective as having the person do live testing. A cost-benefit analysis, using two published equations, one for Cost and the other for Benefit, was also performed to determine if automated regression testing is really more effective than manual testing. Cost captures the salaries of the engineers who perform regression testing tasks and Benefit captures revenue gains or losses related to changes in product release time. ART had a higher benefit of $21461.08 when compared to manual regression testing which had a benefit of $21393.99. Coupled with its excellent error detection rates, ART has proven to be very efficient and cost-effective in speech grammar creation and refinement.
Show less - Date Issued
- 2008
- Identifier
- CFE0002437, ucf:47703
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002437
- Title
- ORTHOGRAPHIC SIMILARITY AND FALSE RECOGNITION FOR UNFAMILIAR WORDS.
- Creator
-
Perrotte, Jeffrey, Parra-Tatge, Marisol, University of Central Florida
- Abstract / Description
-
There is evidence of false recognition (FR) driven by orthographic similarities within languages (Lambert, Chang, & Lin, 2001; Raser, 1972) and some evidence that FR crosses languages (Parra, 2013). No study has investigated whether FR based on orthographic similarities occurs for unknown words in an unknown language. This study aimed to answer this question. It further explored whether FR based on orthographic similarities is more likely in a known (English) than in an unknown (Spanish)...
Show moreThere is evidence of false recognition (FR) driven by orthographic similarities within languages (Lambert, Chang, & Lin, 2001; Raser, 1972) and some evidence that FR crosses languages (Parra, 2013). No study has investigated whether FR based on orthographic similarities occurs for unknown words in an unknown language. This study aimed to answer this question. It further explored whether FR based on orthographic similarities is more likely in a known (English) than in an unknown (Spanish) language. Forty-six English monolinguals participated. They studied 50 English and 50 Spanish words during a study phase. A recognition test was given immediately after the study phase. It consisted of 40 Spanish and 40 English words. It included list words (i.e., words presented at study); homographs (i.e., words not presented at study, orthographically similar to words presented at study); and unrelated words (i.e., words not presented at study, not orthographically similar to words presented at study). The LSD post-hoc test showed significant results supporting the hypothesis that false recognition based on orthographic similarities occurs for words in a known language (English) and in an unknown language (Spanish). Further evidence was provided by the LSD post-hoc test supporting the hypothesis that false recognition based on orthographic similarities was more likely to occur in a known language than an unknown language. Results provided evidence that the meaning and orthographic form are used when information is encoded thereby influencing recognition decisions. Furthermore, these results emphasize the significance of orthography when information is encoded and retrieved. Keywords: false recognition, orthography, semantic, orthographic distinctiveness, semantic distinctiveness, English monolingual
Show less - Date Issued
- 2015
- Identifier
- CFH0004906, ucf:45501
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFH0004906
- Title
- VECTORPAD: A TOOL FOR VISUALIZING VECTOR OPERATIONS.
- Creator
-
Bott, Jared, LaViola, Joseph, University of Central Florida
- Abstract / Description
-
Visualization of three-dimensional vector operations can be very helpful in understanding vector mathematics. However, creating these visualizations using traditional WIMP interfaces can be a troublesome exercise. In this thesis, we present VectorPad, a pen-based application for three-dimensional vector mathematics visualization. VectorPad allows users to define vectors and perform mathematical operations upon them through the recognition of handwritten mathematics. The VectorPad user...
Show moreVisualization of three-dimensional vector operations can be very helpful in understanding vector mathematics. However, creating these visualizations using traditional WIMP interfaces can be a troublesome exercise. In this thesis, we present VectorPad, a pen-based application for three-dimensional vector mathematics visualization. VectorPad allows users to define vectors and perform mathematical operations upon them through the recognition of handwritten mathematics. The VectorPad user interface consists of a sketching area, where the user can write vector definitions and other mathematics, and a 3D graph for visualization. After recognition, vectors are visualized dynamically on the graph, which can be manipulated by the user. A variety of mathematical operations can be performed, such as addition, subtraction, scalar multiplication, and cross product. Animations show how operations work on the vectors. We also performed a short, informal user study evaluating the user interface and visualizations of VectorPad. VectorPad's visualizations were generally well liked; results from the study show a need to provide a more comprehensive set of visualization tools as well as refinement to some of the animations.
Show less - Date Issued
- 2009
- Identifier
- CFE0002827, ucf:48087
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002827
- Title
- Fast Compressed Automatic Target Recognition for a Compressive Infrared Imager.
- Creator
-
Millikan, Brian, Foroosh, Hassan, Rahnavard, Nazanin, Muise, Robert, Atia, George, Mahalanobis, Abhijit, Sun, Qiyu, University of Central Florida
- Abstract / Description
-
Many military systems utilize infrared sensors which allow an operator to see targets at night. Several of these are either mid-wave or long-wave high resolution infrared sensors, which are expensive to manufacture. But compressive sensing, which has primarily been demonstrated in medical applications, can be used to minimize the number of measurements needed to represent a high-resolution image. Using these techniques, a relatively low cost mid-wave infrared sensor can be realized which has...
Show moreMany military systems utilize infrared sensors which allow an operator to see targets at night. Several of these are either mid-wave or long-wave high resolution infrared sensors, which are expensive to manufacture. But compressive sensing, which has primarily been demonstrated in medical applications, can be used to minimize the number of measurements needed to represent a high-resolution image. Using these techniques, a relatively low cost mid-wave infrared sensor can be realized which has a high effective resolution. In traditional military infrared sensing applications, like targeting systems, automatic targeting recognition algorithms are employed to locate and identify targets of interest to reduce the burden on the operator. The resolution of the sensor can increase the accuracy and operational range of a targeting system. When using a compressive sensing infrared sensor, traditional decompression techniques can be applied to form a spatial-domain infrared image, but most are iterative and not ideal for real-time environments. A more efficient method is to adapt the target recognition algorithms to operate directly on the compressed samples. In this work, we will present a target recognition algorithm which utilizes a compressed target detection method to identify potential target areas and then a specialized target recognition technique that operates directly on the same compressed samples. We will demonstrate our method on the U.S. Army Night Vision and Electronic Sensors Directorate ATR Algorithm Development Image Database which has been made available by the Sensing Information Analysis Center.
Show less - Date Issued
- 2018
- Identifier
- CFE0007408, ucf:52739
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007408
- Title
- The Perceptual and Decisional Basis of Emotion Identification in Creative Writing.
- Creator
-
Williams, Sarah, Bohil, Corey, Hancock, Peter, Smither, Janan, Johnson, Dan, University of Central Florida
- Abstract / Description
-
The goal of this research was to assess the ability of readers to determine the emotion of a passage of text, be it fictional or non-fictional. The research includes examining how genre (fiction and non-fiction) and emotion (positive emotion, such as happiness, and negative emotion, such as anger) interact to form a reading experience. Reading is an activity that many, if not most, humans undertake in either a professional or leisure capacity. Researchers are thus interested in the effect...
Show moreThe goal of this research was to assess the ability of readers to determine the emotion of a passage of text, be it fictional or non-fictional. The research includes examining how genre (fiction and non-fiction) and emotion (positive emotion, such as happiness, and negative emotion, such as anger) interact to form a reading experience. Reading is an activity that many, if not most, humans undertake in either a professional or leisure capacity. Researchers are thus interested in the effect reading has on the individual, particularly with regards to empathy. Some researchers believe reading fosters empathy; others think empathy might already be present in those who enjoy reading. A greater understanding of this dispute could be provided by general recognition theory (GRT). GRT allows researchers to investigate how stimulus dimensions interact in an observer's mind: on a perceptual or decisional level. In the context of reading, this allows researchers to look at how emotion is tied in with (or inseparable from) genre, or if the ability to determine the emotion of a passage is independent from the genre of the passage. In the reported studies, participants read passages and responded to questions on the passages and their content. Empathy scores significantly predicted discriminability of passage categories, as did reported hours spent reading per week. Non-fiction passages were easier to identify than fiction, and positive emotion classification was affiliated with non-fiction classification.
Show less - Date Issued
- 2019
- Identifier
- CFE0007877, ucf:52760
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007877
- Title
- Action Recognition, Temporal Localization and Detection in Trimmed and Untrimmed Video.
- Creator
-
Hou, Rui, Shah, Mubarak, Mahalanobis, Abhijit, Hua, Kien, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
Automatic understanding of videos is one of the most active areas of computer vision research. It has applications in video surveillance, human computer interaction, video sports analysis, virtual and augmented reality, video retrieval etc. In this dissertation, we address four important tasks in video understanding, namely action recognition, temporal action localization, spatial-temporal action detection and video object/action segmentation. This dissertation makes contributions to above...
Show moreAutomatic understanding of videos is one of the most active areas of computer vision research. It has applications in video surveillance, human computer interaction, video sports analysis, virtual and augmented reality, video retrieval etc. In this dissertation, we address four important tasks in video understanding, namely action recognition, temporal action localization, spatial-temporal action detection and video object/action segmentation. This dissertation makes contributions to above tasks by proposing. First, for video action recognition, we propose a category level feature learning method. Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, and a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. Second, for temporal action localization, we propose to exploit the temporal structure of actions by modeling an action as a sequence of sub-actions and present a computationally efficient approach. Third, we propose 3D Tube Convolutional Neural Network (TCNN) based pipeline for action detection. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. It generalizes the popular faster R-CNN framework from images to videos. Last, an end-to-end encoder-decoder based 3D convolutional neural network pipeline is proposed, which is able to segment out the foreground objects from the background. Moreover, the action label can be obtained as well by passing the foreground object into an action classifier. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for video understanding compared to the state-of-the-art.
Show less - Date Issued
- 2019
- Identifier
- CFE0007655, ucf:52502
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007655
- Title
- PHONEME-BASED VIDEO INDEXING USING PHONETIC DISPARITY SEARCH.
- Creator
-
Leon-Barth, Carlos, DeMara, Ronald, University of Central Florida
- Abstract / Description
-
This dissertation presents and evaluates a method to the video indexing problem by investigating a categorization method that transcribes audio content through Automatic Speech Recognition (ASR) combined with Dynamic Contextualization (DC), Phonetic Disparity Search (PDS) and Metaphone indexation. The suggested approach applies genome pattern matching algorithms with computational summarization to build a database infrastructure that provides an indexed summary of the original audio content....
Show moreThis dissertation presents and evaluates a method to the video indexing problem by investigating a categorization method that transcribes audio content through Automatic Speech Recognition (ASR) combined with Dynamic Contextualization (DC), Phonetic Disparity Search (PDS) and Metaphone indexation. The suggested approach applies genome pattern matching algorithms with computational summarization to build a database infrastructure that provides an indexed summary of the original audio content. PDS complements the contextual phoneme indexing approach by optimizing topic seek performance and accuracy in large video content structures. A prototype was established to translate news broadcast video into text and phonemes automatically by using ASR utterance conversions. Each phonetic utterance extraction was then categorized, converted to Metaphones, and stored in a repository with contextual topical information attached and indexed for posterior search analysis. Following the original design strategy, a custom parallel interface was built to measure the capabilities of dissimilar phonetic queries and provide an interface for result analysis. The postulated solution provides evidence of a superior topic matching when compared to traditional word and phoneme search methods. Experimental results demonstrate that PDS can be 3.7% better than the same phoneme query, Metaphone search proved to be 154.6% better than the same phoneme seek and 68.1 % better than the equivalent word search.
Show less - Date Issued
- 2010
- Identifier
- CFE0003480, ucf:48979
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003480
- Title
- Weakly Labeled Action Recognition and Detection.
- Creator
-
Sultani, Waqas, Shah, Mubarak, Bagci, Ulas, Qi, GuoJun, Yun, Hae-Bum, University of Central Florida
- Abstract / Description
-
Research in human action recognition strives to develop increasingly generalized methods thatare robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendousstrides in improving recognition accuracy on ever larger and complex benchmark datasets, comprisingrealistic actions (")in the wild(") videos. Unfortunately, the all-encompassing, dense, globalrepresentations that bring about such improvements often benefit from the inherent characteristics,specific to...
Show moreResearch in human action recognition strives to develop increasingly generalized methods thatare robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendousstrides in improving recognition accuracy on ever larger and complex benchmark datasets, comprisingrealistic actions (")in the wild(") videos. Unfortunately, the all-encompassing, dense, globalrepresentations that bring about such improvements often benefit from the inherent characteristics,specific to datasets and classes, that do not necessarily reflect knowledge about the entity to berecognized. This results in specific models that perform well within datasets but generalize poorly.Furthermore, training of supervised action recognition and detection methods need several precisespatio-temporal manual annotations to achieve good recognition and detection accuracy. For instance,current deep learning architectures require millions of accurately annotated videos to learnrobust action classifiers. However, these annotations are quite difficult to achieve.In the first part of this dissertation, we explore the reasons for poor classifier performance whentested on novel datasets, and quantify the effect of scene backgrounds on action representationsand recognition. We attempt to address the problem of recognizing human actions while trainingand testing on distinct datasets when test videos are neither labeled nor available during training. Inthis scenario, learning of a joint vocabulary, or domain transfer techniques are not applicable. Weperform different types of partitioning of the GIST feature space for several datasets and computemeasures of background scene complexity, as well as, for the extent to which scenes are helpfulin action classification. We then propose a new process to obtain a measure of confidence in eachpixel of the video being a foreground region using motion, appearance, and saliency together in a3D-Markov Random Field (MRF) based framework. We also propose multiple ways to exploit theforeground confidence: to improve bag-of-words vocabulary, histogram representation of a video,and a novel histogram decomposition based representation and kernel.iiiThe above-mentioned work provides probability of each pixel being belonging to the actor, however,it does not give the precise spatio-temporal location of the actor. Furthermore, above frameworkwould require precise spatio-temporal manual annotations to train an action detector. However,manual annotations in videos are laborious, require several annotators and contain humanbiases. Therefore, in the second part of this dissertation, we propose a weakly labeled approachto automatically obtain spatio-temporal annotations of actors in action videos. We first obtain alarge number of action proposals in each video. To capture a few most representative action proposalsin each video and evade processing thousands of them, we rank them using optical flow andsaliency in a 3D-MRF based framework and select a few proposals using MAP based proposal subsetselection method. We demonstrate that this ranking preserves the high-quality action proposals.Several such proposals are generated for each video of the same action. Our next challenge is toiteratively select one proposal from each video so that all proposals are globally consistent. Weformulate this as Generalized Maximum Clique Graph problem (GMCP) using shape, global andfine-grained similarity of proposals across the videos. The output of our method is the most actionrepresentative proposals from each video. Using our method can also annotate multiple instancesof the same action in a video can also be annotated. Moreover, action detection experiments usingannotations obtained by our method and several baselines demonstrate the superiority of ourapproach.The above-mentioned annotation method uses multiple videos of the same action. Therefore, inthe third part of this dissertation, we tackle the problem of spatio-temporal action localization in avideo, without assuming the availability of multiple videos or any prior annotations. The action islocalized by employing images downloaded from the Internet using action label. Given web images,we first dampen image noise using random walk and evade distracting backgrounds withinimages using image action proposals. Then, given a video, we generate multiple spatio-temporalaction proposals. We suppress camera and background generated proposals by exploiting opticalivflow gradients within proposals. To obtain the most action representative proposals, we propose toreconstruct action proposals in the video by leveraging the action proposals in images. Moreover,we preserve the temporal smoothness of the video and reconstruct all proposal bounding boxesjointly using the constraints that push the coefficients for each bounding box toward a commonconsensus, thus enforcing the coefficient similarity across multiple frames. We solve this optimizationproblem using the variant of two-metric projection algorithm. Finally, the video proposalthat has the lowest reconstruction cost and is motion salient is used to localize the action. Ourmethod is not only applicable to the trimmed videos, but it can also be used for action localizationin untrimmed videos, which is a very challenging problem.Finally, in the third part of this dissertation, we propose a novel approach to generate a few properlyranked action proposals from a large number of noisy proposals. The proposed approach beginswith dividing each proposal into sub-proposals. We assume that the quality of proposal remainsthe same within each sub-proposal. We, then employ a graph optimization method to recombinethe sub-proposals in all action proposals in a single video in order to optimally build new actionproposals and rank them by the combined node and edge scores. For an untrimmed video, we firstdivide the video into shots and then make the above-mentioned graph within each shot. Our methodgenerates a few ranked proposals that can be better than all the existing underlying proposals. Ourexperimental results validated that the properly ranked action proposals can significantly boostaction detection results.Our extensive experimental results on different challenging and realistic action datasets, comparisonswith several competitive baselines and detailed analysis of each step of proposed methodsvalidate the proposed ideas and frameworks.
Show less - Date Issued
- 2017
- Identifier
- CFE0006801, ucf:51809
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006801
- Title
- Exploring sparsity, self-similarity, and low rank approximation in action recognition, motion retrieval, and action spotting.
- Creator
-
Sun, Chuan, Foroosh, Hassan, Hughes, Charles, Tappen, Marshall, Sukthankar, Rahul, Moshell, Jack, University of Central Florida
- Abstract / Description
-
This thesis consists of $4$ major parts. In the first part (Chapters $1-2$), we introduce the overview, motivation, and contribution of our works, and extensively survey the current literature for $6$ related topics. In the second part (Chapters $3-7$), we explore the concept of ``Self-Similarity" in two challenging scenarios, namely, the Action Recognition and the Motion Retrieval. We build three-dimensional volume representations for both scenarios, and devise effective techniques that can...
Show moreThis thesis consists of $4$ major parts. In the first part (Chapters $1-2$), we introduce the overview, motivation, and contribution of our works, and extensively survey the current literature for $6$ related topics. In the second part (Chapters $3-7$), we explore the concept of ``Self-Similarity" in two challenging scenarios, namely, the Action Recognition and the Motion Retrieval. We build three-dimensional volume representations for both scenarios, and devise effective techniques that can produce compact representations encoding the internal dynamics of data. In the third part (Chapter $8$), we explore the challenging action spotting problem, and propose a feature-independent unsupervised framework that is effective in spotting action under various real situations, even under heavily perturbed conditions. The final part (Chapters $9$) is dedicated to conclusions and future works.For action recognition, we introduce a generic method that does not depend on one particular type of input feature vector. We make three main contributions: (i) We introduce the concept of Joint Self-Similarity Volume (Joint SSV) for modeling dynamical systems, and show that by using a new optimized rank-1 tensor approximation of Joint SSV one can obtain compact low-dimensional descriptors that very accurately preserve the dynamics of the original system, e.g. an action video sequence; (ii) The descriptor vectors derived from the optimized rank-1 approximation make it possible to recognize actions without explicitly aligning the action sequences of varying speed of execution or difference frame rates; (iii) The method is generic and can be applied using different low-level features such as silhouettes, histogram of oriented gradients (HOG), etc. Hence, it does not necessarily require explicit tracking of features in the space-time volume. Our experimental results on five public datasets demonstrate that our method produces very good results and outperforms many baseline methods.For action recognition for incomplete videos, we determine whether incomplete videos that are often discarded carry useful information for action recognition, and if so, how one can represent such mixed collection of video data (complete versus incomplete, and labeled versus unlabeled) in a unified manner. We propose a novel framework to handle incomplete videos in action classification, and make three main contributions: (i) We cast the action classification problem for a mixture of complete and incomplete data as a semi-supervised learning problem of labeled and unlabeled data. (ii) We introduce a two-step approach to convert the input mixed data into a uniform compact representation. (iii) Exhaustively scrutinizing $280$ configurations, we experimentally show on our two created benchmarks that, even when the videos are extremely sparse and incomplete, it is still possible to recover useful information from them, and classify unknown actions by a graph based semi-supervised learning framework.For motion retrieval, we present a framework that allows for a flexible and an efficient retrieval of motion capture data in huge databases. The method first converts an action sequence into a self-similarity matrix (SSM), which is based on the notion of self-similarity. This conversion of the motion sequences into compact and low-rank subspace representations greatly reduces the spatiotemporal dimensionality of the sequences. The SSMs are then used to construct order-3 tensors, and we propose a low-rank decomposition scheme that allows for converting the motion sequence volumes into compact lower dimensional representations, without losing the nonlinear dynamics of the motion manifold. Thus, unlike existing linear dimensionality reduction methods that distort the motion manifold and lose very critical and discriminative components, the proposed method performs well, even when inter-class differences are small or intra-class differences are large. In addition, the method allows for an efficient retrieval and does not require the time-alignment of the motion sequences. We evaluate the performance of our retrieval framework on the CMU mocap dataset under two experimental settings, both demonstrating very good retrieval rates.For action spotting, our framework does not depend on any specific feature (e.g. HOG/HOF, STIP, silhouette, bag-of-words, etc.), and requires no human localization, segmentation, or framewise tracking. This is achieved by treating the problem holistically as that of extracting the internal dynamics of video cuboids by modeling them in their natural form as multilinear tensors. To extract their internal dynamics, we devised a novel Two-Phase Decomposition (TP-Decomp) of a tensor that generates very compact and discriminative representations that are robust to even heavily perturbed data. Technically, a Rank-based Tensor Core Pyramid (Rank-TCP) descriptor is generated by combining multiple tensor cores under multiple ranks, allowing to represent video cuboids in a hierarchical tensor pyramid. The problem then reduces to a template matching problem, which is solved efficiently by using two boosting strategies: (i) to reduce the search space, we filter the dense trajectory cloud extracted from the target video; (ii) to boost the matching speed, we perform matching in an iterative coarse-to-fine manner. Experiments on 5 benchmarks show that our method outperforms current state-of-the-art under various challenging conditions. We also created a challenging dataset called Heavily Perturbed Video Arrays (HPVA) to validate the robustness of our framework under heavily perturbed situations.
Show less - Date Issued
- 2014
- Identifier
- CFE0005554, ucf:50290
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005554
- Title
- GEOMETRIC INVARIANCE IN THE ANALYSIS OF HUMAN MOTION IN VIDEO DATA.
- Creator
-
Shen, Yuping, Foroosh, Hassan, University of Central Florida
- Abstract / Description
-
Human motion analysis is one of the major problems in computer vision research. It deals with the study of the motion of human body in video data from different aspects, ranging from the tracking of body parts and reconstruction of 3D human body configuration, to higher level of interpretation of human action and activities in image sequences. When human motion is observed through video camera, it is perspectively distorted and may appear totally different from different viewpoints. Therefore...
Show moreHuman motion analysis is one of the major problems in computer vision research. It deals with the study of the motion of human body in video data from different aspects, ranging from the tracking of body parts and reconstruction of 3D human body configuration, to higher level of interpretation of human action and activities in image sequences. When human motion is observed through video camera, it is perspectively distorted and may appear totally different from different viewpoints. Therefore it is highly challenging to establish correct relationships between human motions across video sequences with different camera settings. In this work, we investigate the geometric invariance in the motion of human body, which is critical to accurately understand human motion in video data regardless of variations in camera parameters and viewpoints. In human action analysis, the representation of human action is a very important issue, and it usually determines the nature of the solutions, including their limits in resolving the problem. Unlike existing research that study human motion as a whole 2D/3D object or a sequence of postures, we study human motion as a sequence of body pose transitions. We also decompose a human body pose further into a number of body point triplets, and break down a pose transition into the transition of a set of body point triplets. In this way the study of complex non-rigid motion of human body is reduced to that of the motion of rigid body point triplets, i.e. a collection of planes in motion. As a result, projective geometry and linear algebra can be applied to explore the geometric invariance in human motion. Based on this formulation, we have discovered the fundamental ratio invariant and the eigenvalue equality invariant in human motion. We also propose solutions based on these geometric invariants to the problems of view-invariant recognition of human postures and actions, as well as analysis of human motion styles. These invariants and their applicability have been validated by experimental results supporting that their effectiveness in understanding human motion with various camera parameters and viewpoints.
Show less - Date Issued
- 2009
- Identifier
- CFE0002945, ucf:47970
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002945
- Title
- TEXT-IMAGE RESTORATION AND TEXT ALIGNMENT FOR MULTI-ENGINE OPTICAL CHARACTER RECOGNITION SYSTEMS.
- Creator
-
Kozlovski, Nikolai, Weeks, Arthur, University of Central Florida
- Abstract / Description
-
Previous research showed that combining three different optical character recognition (OCR) engines (ExperVision® OCR, Scansoft OCR, and Abbyy® OCR) results using voting algorithms will get higher accuracy rate than each of the engines individually. While a voting algorithm has been realized, several aspects to automate and improve the accuracy rate needed further research. This thesis will focus on morphological image preprocessing and morphological text restoration that goes to OCR...
Show morePrevious research showed that combining three different optical character recognition (OCR) engines (ExperVision® OCR, Scansoft OCR, and Abbyy® OCR) results using voting algorithms will get higher accuracy rate than each of the engines individually. While a voting algorithm has been realized, several aspects to automate and improve the accuracy rate needed further research. This thesis will focus on morphological image preprocessing and morphological text restoration that goes to OCR engines. This method is similar to the one used in restoration partial finger prints. Series of morphological dilating and eroding filters of various mask shapes and sizes were applied to text of different font sizes and types with various noises added. These images were then processed by the OCR engines, and based on these results successful combinations of text, noise, and filters were chosen. The thesis will also deal with the problem of text alignment. Each OCR engine has its own way of dealing with noise and corrupted characters; as a result, the output texts of OCR engines have different lengths and number of words. This in turn, makes it impossible to use spaces a delimiter as a method to separate the words for processing by the voting part of the system. Text aligning determines, using various techniques, what is an extra word, what is supposed to be two or more words instead of one, which words are missing in one document compared to the other, etc. Alignment algorithm is made up of a series of shifts in the two texts to determine which parts are similar and which are not. Since errors made by OCR engines are due to visual misrecognition, in addition to simple character comparison (equal or not), a technique was developed that allows comparison of characters based on how they look.
Show less - Date Issued
- 2006
- Identifier
- CFE0001060, ucf:46799
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0001060
- Title
- RECOGNIZING PAIN USING NOVEL SIMULATION TECHNOLOGY.
- Creator
-
Grace, Justin C, Allred, Kelly, University of Central Florida
- Abstract / Description
-
Effective pain management and time to treatment is essential in patient care. Despite scientific evidence supporting the need to treat pain and an emphasis on addressing pain as a priority, pain management continues to be an unresolved issue. As a member of the health care team, nurses are integral to optimal pain management. Currently, nursing schools have limited innovative or alternative methods for teaching pain assessment and management. Simulation in nursing education provides a unique...
Show moreEffective pain management and time to treatment is essential in patient care. Despite scientific evidence supporting the need to treat pain and an emphasis on addressing pain as a priority, pain management continues to be an unresolved issue. As a member of the health care team, nurses are integral to optimal pain management. Currently, nursing schools have limited innovative or alternative methods for teaching pain assessment and management. Simulation in nursing education provides a unique opportunity to expose students to realistic patient situations and allow them to learn and make mistakes without causing harm. However, modern low- and high-fidelity simulation technology is unable to display emotion, pain, or any facial expression. This limits training and education of conditions that may partially rely on the identification of symptoms based on the alteration of facial appearance, such as pain or stroke. This research explored student nurses� perception of new technology that displayed computer-generated faces, each expressing varying degrees of physical expressions of pain. A total of 15 nursing students participated in the study. Students were asked to interpret the level of pain in four sequential faces using a numeric rating scale of 0-10, with 0 indicating no pain, and 10 the most severe pain possible. After scoring the faces, students were asked to answer four open-ended questions addressing the technology. Results of the study indicate a majority of nursing students believe the technology should be implemented into nursing curriculum and interacting with the projected faces was more beneficial than traditional teaching methods. Eventually, the potential for increased identification of conditions requiring observation of subtle facial changes will be explored.
Show less - Date Issued
- 2016
- Identifier
- CFH2000002, ucf:45569
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFH2000002
- Title
- INVESTIGATION OF DAMAGE DETECTION METHODOLOGIES FOR STRUCTURAL HEALTH MONITORING.
- Creator
-
Gul, Mustafa, Catbas, F. Necati, University of Central Florida
- Abstract / Description
-
Structural Health Monitoring (SHM) is employed to track and evaluate damage and deterioration during regular operation as well as after extreme events for aerospace, mechanical and civil structures. A complete SHM system incorporates performance metrics, sensing, signal processing, data analysis, transmission and management for decision-making purposes. Damage detection in the context of SHM can be successful by employing a collection of robust and practical damage detection methodologies...
Show moreStructural Health Monitoring (SHM) is employed to track and evaluate damage and deterioration during regular operation as well as after extreme events for aerospace, mechanical and civil structures. A complete SHM system incorporates performance metrics, sensing, signal processing, data analysis, transmission and management for decision-making purposes. Damage detection in the context of SHM can be successful by employing a collection of robust and practical damage detection methodologies that can be used to identify, locate and quantify damage or, in general terms, changes in observable behavior. In this study, different damage detection methods are investigated for global condition assessment of structures. First, different parametric and non-parametric approaches are re-visited and further improved for damage detection using vibration data. Modal flexibility, modal curvature and un-scaled flexibility based on the dynamic properties that are obtained using Complex Mode Indicator Function (CMIF) are used as parametric damage features. Second, statistical pattern recognition approaches using time series modeling in conjunction with outlier detection are investigated as a non-parametric damage detection technique. Third, a novel methodology using ARX models (Auto-Regressive models with eXogenous output) is proposed for damage identification. By using this new methodology, it is shown that damage can be detected, located and quantified without the need of external loading information. Next, laboratory studies are conducted on different test structures with a number of different damage scenarios for the evaluation of the techniques in a comparative fashion. Finally, application of the methodologies to real life data is also presented along with the capabilities and limitations of each approach in light of analysis results of the laboratory and real life data.
Show less - Date Issued
- 2009
- Identifier
- CFE0002830, ucf:48069
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002830
- Title
- OBJECT TRACKING AND ACTIVITY RECOGNITION IN VIDEO ACQUIRED USING MOBILE CAMERAS.
- Creator
-
Yilmaz, Alper, Shah, Mubarak, University of Central Florida
- Abstract / Description
-
Due to increasing demand on deployable surveillance systems in recent years, object tracking and activity recognition are receiving considerable attention in the research community. This thesis contributes to both the tracking and the activity recognition components of a surveillance system. In particular, for the tracking component, we propose two different approaches for tracking objects in video acquired by mobile cameras, each of which uses a different object shape representation. The...
Show moreDue to increasing demand on deployable surveillance systems in recent years, object tracking and activity recognition are receiving considerable attention in the research community. This thesis contributes to both the tracking and the activity recognition components of a surveillance system. In particular, for the tracking component, we propose two different approaches for tracking objects in video acquired by mobile cameras, each of which uses a different object shape representation. The first approach tracks the centroids of the objects in Forward Looking Infrared Imagery (FLIR) and is suitable for tracking objects that appear small in airborne video. The second approach tracks the complete contours of the objects, and is suitable for higher level vision problems, such as activity recognition, identification and classification. Using the contours tracked by the contour tracker, we propose a novel representation, called the action sketch, for recognizing human activities.Object Tracking in Airborne Imagery: Images obtained from an airborne vehicle generally appear small and can be represented by geometric shapes such as circle or rectangle. After detecting the object position in the first frame, the proposed object tracker models the intensity and the local standard deviation of the object region defined by the shape model. It then tracks the objects by computing the mean-shift vector that minimizes the distance between the kernel distribution for the hypothesized object and its prior. In cases when the ego-motion of the sensor causes the object to move more than the operational limits of the tracking module, a multi-resolution global motion compensation using the Gabor responses of consecutive frames is performed. The experiments performed on the AMCOM FLIR data set show the robustness of the proposed method, which combines automatic model update and global motion compensation into one framework.Contour Tracker: Contour tracking is performed by evolving an initial contour toward the correct object boundaries based on discriminant analysis, which is formulated as a variational calculus problem. Once the contour is initialized, the method generates an online shape model for the object along with the color and the texture priors for both the object and the background regions. A priori texture and color PDFs of the regions are then fused based on the discrimination properties of the features between the object and the background models. The models are then used to compute the posteriori contour likelihood and the evolution is obtained by the Maximum a Posteriori Estimation process, which updates the contour in the gradient ascent direction of the proposed energy functional. During occlusion, the online shape model is used to complete the missing object region. The proposed energy functional unifies commonly used boundary and region based contour approaches into a single framework through a support region defined around the hypothesized object contour. We tested the robustness of the proposed contour tracker using several real sequences and have verified qualitatively that the contours of the objects are perfectly tracked.Behavior Analysis: We propose a novel approach to represent human actions by modeling the dynamics (motion) and the structure (shape) of the objects in video. Both the motion and the shape are modeled using a compact representation, which is called the ``action sketch''. An action sketch is a view invariant representation obtained by analyzing important changes that occur during the motion of the objects. When an actor performs an action in 3D, the points on the actor generate space-time trajectories in four dimensions $(x,y,z,t)$. Projection of the world to the imaging coordinates converts the space-time trajectories into the spatio-temporal trajectories in three dimensions $(x,y,t)$. A set of spatio-temporal trajectories constitute a 3D volume, which we call an ``action volume''. This volum
Show less - Date Issued
- 2004
- Identifier
- CFE0000101, ucf:52858
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000101
- Title
- A Study of Localization and Latency Reduction for Action Recognition.
- Creator
-
Masood, Syed, Tappen, Marshall, Foroosh, Hassan, Stanley, Kenneth, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
The success of recognizing periodic actions in single-person-simple-background datasets, such as Weizmann and KTH, has created a need for more complex datasets to push the performance of action recognition systems. In this work, we create a new synthetic action dataset and use it to highlight weaknesses in current recognition systems. Experiments show that introducing background complexity to action video sequences causes a significant degradation in recognition performance. Moreover, this...
Show moreThe success of recognizing periodic actions in single-person-simple-background datasets, such as Weizmann and KTH, has created a need for more complex datasets to push the performance of action recognition systems. In this work, we create a new synthetic action dataset and use it to highlight weaknesses in current recognition systems. Experiments show that introducing background complexity to action video sequences causes a significant degradation in recognition performance. Moreover, this degradation cannot be fixed by fine-tuning system parameters or by selecting better feature points. Instead, we show that the problem lies in the spatio-temporal cuboid volume extracted from the interest point locations. Having identified the problem, we show how improved results can be achieved by simple modifications to the cuboids.For the above method however, one requires near-perfect localization of the action within a video sequence. To achieve this objective, we present a two stage weakly supervised probabilistic model for simultaneous localization and recognition of actions in videos. Different from previous approaches, our method is novel in that it (1) eliminates the need for manual annotations for the training procedure and (2) does not require any human detection or tracking in the classification stage. The first stage of our framework is a probabilistic action localization model which extracts the most promising sub-windows in a video sequence where an action can take place. We use a non-linear classifier in the second stage of our framework for the final classification task. We show the effectiveness of our proposed model on two well known real-world datasets: UCF Sports and UCF11 datasets.Another application of the weakly supervised probablistic model proposed above is in the gaming environment. An important aspect in designing interactive, action-based interfaces is reliably recognizing actions with minimal latency. High latency causes the system's feedback to lag behind and thus significantly degrade the interactivity of the user experience. With slight modification to the weakly supervised probablistic model we proposed for action localization, we show how it can be used for reducing latency when recognizing actions in Human Computer Interaction (HCI) environments. This latency-aware learning formulation trains a logistic regression-based classifier that automatically determines distinctive canonical poses from the data and uses these to robustly recognize actions in the presence of ambiguous poses. We introduce a novel (publicly released) dataset for the purpose of our experiments. Comparisons of our method against both a Bag of Words and a Conditional Random Field (CRF) classifier show improved recognition performance for both pre-segmented and online classification tasks.
Show less - Date Issued
- 2012
- Identifier
- CFE0004575, ucf:49210
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004575
- Title
- Human Action Localization and Recognition in Unconstrained Videos.
- Creator
-
Boyraz, Hakan, Tappen, Marshall, Foroosh, Hassan, Lin, Mingjie, Zhang, Shaojie, Sukthankar, Rahul, University of Central Florida
- Abstract / Description
-
As imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing...
Show moreAs imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing discriminative sub-regions of images and videos when performing recognition tasks. In this thesis, we address the action detection and recognition problems. Action detection in video is a particularly difficult problem because actions must not only be recognized correctly, but must also be localized in the 3D spatio-temporal volume. We introduce a technique that transforms the 3D localization problem into a series of 2D detection tasks. This is accomplished by dividing the video into overlapping segments, then representing each segment with a 2D video projection. The advantage of the 2D projection is that it makes it convenient to apply the best techniques from object detection to the action detection problem. We also introduce a novel, straightforward method for searching the 2D projections to localize actions, termed Two-Point Subwindow Search (TPSS). Finally, we show how to connect the local detections in time using a chaining algorithm to identify the entire extent of the action. Our experiments show that video projection outperforms the latest results on action detection in a direct comparison.Second, we present a probabilistic model learning to identify discriminative regions in videos from weakly-supervised data where each video clip is only assigned a label describing what action is present in the frame or clip. While our first system requires every action to be manually outlined in every frame of the video, this second system only requires that the video be given a single high-level tag. From this data, the system is able to identify discriminative regions that correspond well to the regions containing the actual actions. Our experiments on both the MSR Action Dataset II and UCF Sports Dataset show that the localizations produced by this weakly supervised system are comparable in quality to localizations produced by systems that require each frame to be manually annotated. This system is able to detect actions in both 1) non-temporally segmented action videos and 2) recognition tasks where a single label is assigned to the clip. We also demonstrate the action recognition performance of our method on two complex datasets, i.e. HMDB and UCF101. Third, we extend our weakly-supervised framework by replacing the recognition stage with a two-stage neural network and apply dropout for preventing overfitting of the parameters on the training data. Dropout technique has been recently introduced to prevent overfitting of the parameters in deep neural networks and it has been applied successfully to object recognition problem. To our knowledge, this is the first system using dropout for action recognition problem. We demonstrate that using dropout improves the action recognition accuracies on HMDB and UCF101 datasets.
Show less - Date Issued
- 2013
- Identifier
- CFE0004977, ucf:49562
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0004977
- Title
- FEATURE PRUNING FOR ACTION RECOGNITION IN COMPLEX ENVIRONMENT.
- Creator
-
Nagaraja, Adarsh, Tappen, Marshall, University of Central Florida
- Abstract / Description
-
A significant number of action recognition research efforts use spatio-temporal interest point detectors for feature extraction. Although the extracted features provide useful information for recognizing actions, a significant number of them contain irrelevant motion and background clutter. In many cases, the extracted features are included as is in the classification pipeline, and sophisticated noise removal techniques are subsequently used to alleviate their effect on classification. We...
Show moreA significant number of action recognition research efforts use spatio-temporal interest point detectors for feature extraction. Although the extracted features provide useful information for recognizing actions, a significant number of them contain irrelevant motion and background clutter. In many cases, the extracted features are included as is in the classification pipeline, and sophisticated noise removal techniques are subsequently used to alleviate their effect on classification. We introduce a new action database, created from the Weizmann database, that reveals a significant weakness in systems based on popular cuboid descriptors. Experiments show that introducing complex backgrounds, stationary or dynamic, into the video causes a significant degradation in recognition performance. Moreover, this degradation cannot be fixed by fine-tuning the system or selecting better interest points. Instead, we show that the problem lies at the descriptor level and must be addressed by modifying descriptors.
Show less - Date Issued
- 2011
- Identifier
- CFE0003882, ucf:48721
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003882