You are here
Active Learning with Unreliable Annotations
- Date Issued:
- 2013
- Abstract/Description:
- With the proliferation of social media, gathering data has became cheaper and easier than before. However, this data can not be used for supervised machine learning without labels. Asking experts to annotate sufficient data for training is both expensive and time-consuming. Current techniques provide two solutions to reducing the cost and providing sufficient labels: crowdsourcing and active learning. Crowdsourcing, which outsources tasks to a distributed group of people, can be used to provide a large quantity of labels but controlling the quality of labels is hard. Active learning, which requires experts to annotate a subset of the most informative or uncertain data, is very sensitive to the annotation errors. Though these two techniques can be used independently of one another, by using them in combination they can complement each other's weakness. In this thesis, I investigate the development of active learning Support Vector Machines (SVMs) and expand this model to sequential data. Then I discuss the weakness of combining active learning and crowdsourcing, since the active learning is very sensitive to low quality annotations which are unavoidable for labels collected from crowdsourcing. In this thesis, I propose three possible strategies, incremental relabeling, importance-weighted label prediction and active Bayesian Networks. The incremental relabeling strategy requires workers to devote more annotations to uncertain samples, compared to majority voting which allocates different samples the same number of labels. Importance-weighted label prediction employs an ensemble of classifiers to guide the label requests from a pool of unlabeled training data. An active learning version of Bayesian Networks is used to model the difficulty of samples and the expertise of workers simultaneously to evaluate the relative weight of workers' labels during the active learning process. All three strategies apply different techniques with the same expectation -- identifying the optimal solution for applying an active learning model with mixed label quality to crowdsourced data. However, the active Bayesian Networks model, which is the core element of this thesis, provides additional benefits by estimating the expertise of workers during the training phase. As an example application, I also demonstrate the utility of crowdsourcing for human activity recognition problems.
Title: | Active Learning with Unreliable Annotations. |
45 views
23 downloads |
---|---|---|
Name(s): |
Zhao, Liyue, Author Sukthankar, Gita, Committee Chair Tappen, Marshall, Committee Member Georgiopoulos, Michael, Committee Member Sukthankar, Rahul, Committee Member University of Central Florida, Degree Grantor |
|
Type of Resource: | text | |
Date Issued: | 2013 | |
Publisher: | University of Central Florida | |
Language(s): | English | |
Abstract/Description: | With the proliferation of social media, gathering data has became cheaper and easier than before. However, this data can not be used for supervised machine learning without labels. Asking experts to annotate sufficient data for training is both expensive and time-consuming. Current techniques provide two solutions to reducing the cost and providing sufficient labels: crowdsourcing and active learning. Crowdsourcing, which outsources tasks to a distributed group of people, can be used to provide a large quantity of labels but controlling the quality of labels is hard. Active learning, which requires experts to annotate a subset of the most informative or uncertain data, is very sensitive to the annotation errors. Though these two techniques can be used independently of one another, by using them in combination they can complement each other's weakness. In this thesis, I investigate the development of active learning Support Vector Machines (SVMs) and expand this model to sequential data. Then I discuss the weakness of combining active learning and crowdsourcing, since the active learning is very sensitive to low quality annotations which are unavoidable for labels collected from crowdsourcing. In this thesis, I propose three possible strategies, incremental relabeling, importance-weighted label prediction and active Bayesian Networks. The incremental relabeling strategy requires workers to devote more annotations to uncertain samples, compared to majority voting which allocates different samples the same number of labels. Importance-weighted label prediction employs an ensemble of classifiers to guide the label requests from a pool of unlabeled training data. An active learning version of Bayesian Networks is used to model the difficulty of samples and the expertise of workers simultaneously to evaluate the relative weight of workers' labels during the active learning process. All three strategies apply different techniques with the same expectation -- identifying the optimal solution for applying an active learning model with mixed label quality to crowdsourced data. However, the active Bayesian Networks model, which is the core element of this thesis, provides additional benefits by estimating the expertise of workers during the training phase. As an example application, I also demonstrate the utility of crowdsourcing for human activity recognition problems. | |
Identifier: | CFE0004965 (IID), ucf:49579 (fedora) | |
Note(s): |
2013-08-01 Ph.D. Engineering and Computer Science, Computer Science Doctoral This record was generated from author submitted information. |
|
Subject(s): | Active Learning -- Crowdsourcing -- Annotation noise | |
Persistent Link to This Record: | http://purl.flvc.org/ucf/fd/CFE0004965 | |
Restrictions on Access: | public 2013-08-15 | |
Host Institution: | UCF |