You are here
Weakly Labeled Action Recognition and Detection
- Date Issued:
- 2017
- Abstract/Description:
- Research in human action recognition strives to develop increasingly generalized methods thatare robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendousstrides in improving recognition accuracy on ever larger and complex benchmark datasets, comprisingrealistic actions (")in the wild(") videos. Unfortunately, the all-encompassing, dense, globalrepresentations that bring about such improvements often benefit from the inherent characteristics,specific to datasets and classes, that do not necessarily reflect knowledge about the entity to berecognized. This results in specific models that perform well within datasets but generalize poorly.Furthermore, training of supervised action recognition and detection methods need several precisespatio-temporal manual annotations to achieve good recognition and detection accuracy. For instance,current deep learning architectures require millions of accurately annotated videos to learnrobust action classifiers. However, these annotations are quite difficult to achieve.In the first part of this dissertation, we explore the reasons for poor classifier performance whentested on novel datasets, and quantify the effect of scene backgrounds on action representationsand recognition. We attempt to address the problem of recognizing human actions while trainingand testing on distinct datasets when test videos are neither labeled nor available during training. Inthis scenario, learning of a joint vocabulary, or domain transfer techniques are not applicable. Weperform different types of partitioning of the GIST feature space for several datasets and computemeasures of background scene complexity, as well as, for the extent to which scenes are helpfulin action classification. We then propose a new process to obtain a measure of confidence in eachpixel of the video being a foreground region using motion, appearance, and saliency together in a3D-Markov Random Field (MRF) based framework. We also propose multiple ways to exploit theforeground confidence: to improve bag-of-words vocabulary, histogram representation of a video,and a novel histogram decomposition based representation and kernel.iiiThe above-mentioned work provides probability of each pixel being belonging to the actor, however,it does not give the precise spatio-temporal location of the actor. Furthermore, above frameworkwould require precise spatio-temporal manual annotations to train an action detector. However,manual annotations in videos are laborious, require several annotators and contain humanbiases. Therefore, in the second part of this dissertation, we propose a weakly labeled approachto automatically obtain spatio-temporal annotations of actors in action videos. We first obtain alarge number of action proposals in each video. To capture a few most representative action proposalsin each video and evade processing thousands of them, we rank them using optical flow andsaliency in a 3D-MRF based framework and select a few proposals using MAP based proposal subsetselection method. We demonstrate that this ranking preserves the high-quality action proposals.Several such proposals are generated for each video of the same action. Our next challenge is toiteratively select one proposal from each video so that all proposals are globally consistent. Weformulate this as Generalized Maximum Clique Graph problem (GMCP) using shape, global andfine-grained similarity of proposals across the videos. The output of our method is the most actionrepresentative proposals from each video. Using our method can also annotate multiple instancesof the same action in a video can also be annotated. Moreover, action detection experiments usingannotations obtained by our method and several baselines demonstrate the superiority of ourapproach.The above-mentioned annotation method uses multiple videos of the same action. Therefore, inthe third part of this dissertation, we tackle the problem of spatio-temporal action localization in avideo, without assuming the availability of multiple videos or any prior annotations. The action islocalized by employing images downloaded from the Internet using action label. Given web images,we first dampen image noise using random walk and evade distracting backgrounds withinimages using image action proposals. Then, given a video, we generate multiple spatio-temporalaction proposals. We suppress camera and background generated proposals by exploiting opticalivflow gradients within proposals. To obtain the most action representative proposals, we propose toreconstruct action proposals in the video by leveraging the action proposals in images. Moreover,we preserve the temporal smoothness of the video and reconstruct all proposal bounding boxesjointly using the constraints that push the coefficients for each bounding box toward a commonconsensus, thus enforcing the coefficient similarity across multiple frames. We solve this optimizationproblem using the variant of two-metric projection algorithm. Finally, the video proposalthat has the lowest reconstruction cost and is motion salient is used to localize the action. Ourmethod is not only applicable to the trimmed videos, but it can also be used for action localizationin untrimmed videos, which is a very challenging problem.Finally, in the third part of this dissertation, we propose a novel approach to generate a few properlyranked action proposals from a large number of noisy proposals. The proposed approach beginswith dividing each proposal into sub-proposals. We assume that the quality of proposal remainsthe same within each sub-proposal. We, then employ a graph optimization method to recombinethe sub-proposals in all action proposals in a single video in order to optimally build new actionproposals and rank them by the combined node and edge scores. For an untrimmed video, we firstdivide the video into shots and then make the above-mentioned graph within each shot. Our methodgenerates a few ranked proposals that can be better than all the existing underlying proposals. Ourexperimental results validated that the properly ranked action proposals can significantly boostaction detection results.Our extensive experimental results on different challenging and realistic action datasets, comparisonswith several competitive baselines and detailed analysis of each step of proposed methodsvalidate the proposed ideas and frameworks.
Title: | Weakly Labeled Action Recognition and Detection. |
35 views
14 downloads |
---|---|---|
Name(s): |
Sultani, Waqas, Author Shah, Mubarak, Committee Chair Bagci, Ulas, Committee Member Qi, GuoJun, Committee Member Yun, Hae-Bum, Committee Member University of Central Florida, Degree Grantor |
|
Type of Resource: | text | |
Date Issued: | 2017 | |
Publisher: | University of Central Florida | |
Language(s): | English | |
Abstract/Description: | Research in human action recognition strives to develop increasingly generalized methods thatare robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendousstrides in improving recognition accuracy on ever larger and complex benchmark datasets, comprisingrealistic actions (")in the wild(") videos. Unfortunately, the all-encompassing, dense, globalrepresentations that bring about such improvements often benefit from the inherent characteristics,specific to datasets and classes, that do not necessarily reflect knowledge about the entity to berecognized. This results in specific models that perform well within datasets but generalize poorly.Furthermore, training of supervised action recognition and detection methods need several precisespatio-temporal manual annotations to achieve good recognition and detection accuracy. For instance,current deep learning architectures require millions of accurately annotated videos to learnrobust action classifiers. However, these annotations are quite difficult to achieve.In the first part of this dissertation, we explore the reasons for poor classifier performance whentested on novel datasets, and quantify the effect of scene backgrounds on action representationsand recognition. We attempt to address the problem of recognizing human actions while trainingand testing on distinct datasets when test videos are neither labeled nor available during training. Inthis scenario, learning of a joint vocabulary, or domain transfer techniques are not applicable. Weperform different types of partitioning of the GIST feature space for several datasets and computemeasures of background scene complexity, as well as, for the extent to which scenes are helpfulin action classification. We then propose a new process to obtain a measure of confidence in eachpixel of the video being a foreground region using motion, appearance, and saliency together in a3D-Markov Random Field (MRF) based framework. We also propose multiple ways to exploit theforeground confidence: to improve bag-of-words vocabulary, histogram representation of a video,and a novel histogram decomposition based representation and kernel.iiiThe above-mentioned work provides probability of each pixel being belonging to the actor, however,it does not give the precise spatio-temporal location of the actor. Furthermore, above frameworkwould require precise spatio-temporal manual annotations to train an action detector. However,manual annotations in videos are laborious, require several annotators and contain humanbiases. Therefore, in the second part of this dissertation, we propose a weakly labeled approachto automatically obtain spatio-temporal annotations of actors in action videos. We first obtain alarge number of action proposals in each video. To capture a few most representative action proposalsin each video and evade processing thousands of them, we rank them using optical flow andsaliency in a 3D-MRF based framework and select a few proposals using MAP based proposal subsetselection method. We demonstrate that this ranking preserves the high-quality action proposals.Several such proposals are generated for each video of the same action. Our next challenge is toiteratively select one proposal from each video so that all proposals are globally consistent. Weformulate this as Generalized Maximum Clique Graph problem (GMCP) using shape, global andfine-grained similarity of proposals across the videos. The output of our method is the most actionrepresentative proposals from each video. Using our method can also annotate multiple instancesof the same action in a video can also be annotated. Moreover, action detection experiments usingannotations obtained by our method and several baselines demonstrate the superiority of ourapproach.The above-mentioned annotation method uses multiple videos of the same action. Therefore, inthe third part of this dissertation, we tackle the problem of spatio-temporal action localization in avideo, without assuming the availability of multiple videos or any prior annotations. The action islocalized by employing images downloaded from the Internet using action label. Given web images,we first dampen image noise using random walk and evade distracting backgrounds withinimages using image action proposals. Then, given a video, we generate multiple spatio-temporalaction proposals. We suppress camera and background generated proposals by exploiting opticalivflow gradients within proposals. To obtain the most action representative proposals, we propose toreconstruct action proposals in the video by leveraging the action proposals in images. Moreover,we preserve the temporal smoothness of the video and reconstruct all proposal bounding boxesjointly using the constraints that push the coefficients for each bounding box toward a commonconsensus, thus enforcing the coefficient similarity across multiple frames. We solve this optimizationproblem using the variant of two-metric projection algorithm. Finally, the video proposalthat has the lowest reconstruction cost and is motion salient is used to localize the action. Ourmethod is not only applicable to the trimmed videos, but it can also be used for action localizationin untrimmed videos, which is a very challenging problem.Finally, in the third part of this dissertation, we propose a novel approach to generate a few properlyranked action proposals from a large number of noisy proposals. The proposed approach beginswith dividing each proposal into sub-proposals. We assume that the quality of proposal remainsthe same within each sub-proposal. We, then employ a graph optimization method to recombinethe sub-proposals in all action proposals in a single video in order to optimally build new actionproposals and rank them by the combined node and edge scores. For an untrimmed video, we firstdivide the video into shots and then make the above-mentioned graph within each shot. Our methodgenerates a few ranked proposals that can be better than all the existing underlying proposals. Ourexperimental results validated that the properly ranked action proposals can significantly boostaction detection results.Our extensive experimental results on different challenging and realistic action datasets, comparisonswith several competitive baselines and detailed analysis of each step of proposed methodsvalidate the proposed ideas and frameworks. | |
Identifier: | CFE0006801 (IID), ucf:51809 (fedora) | |
Note(s): |
2017-08-01 Ph.D. Engineering and Computer Science, Computer Science Doctoral This record was generated from author submitted information. |
|
Subject(s): | Weakly Labeled -- Human Action Recognition and Detection in videos | |
Persistent Link to This Record: | http://purl.flvc.org/ucf/fd/CFE0006801 | |
Restrictions on Access: | public 2017-08-15 | |
Host Institution: | UCF |