You are here

a priori synthetic sampling for increasing classification sensitivity in imbalanced data sets

Download pdf | Full Screen View

Date Issued:
2015
Abstract/Description:
Building accurate classifiers for predicting group membership is made difficult when data is skewedor imbalanced which is typical of real world data sets. The classifier has the tendency to be biased towards the over represented group as a result. This imbalance is considered a class imbalance problem which will induce bias into the classifier particularly when the imbalance is high.Class imbalance data usually suffers from data intrinsic properties beyond that of imbalance alone.The problem is intensified with larger levels of imbalance most commonly found in observationalstudies. Extreme cases of class imbalance are commonly found in many domains including frauddetection, mammography of cancer and post term births. These rare events are usually the mostcostly or have the highest level of risk associated with them and are therefore of most interest.To combat class imbalance the machine learning community has relied upon embedded, data preprocessing and ensemble learning approaches. Exploratory research has linked several factorsthat perpetuate the issue of misclassification in class imbalanced data. However, there remainsa lack of understanding between the relationship of the learner and imbalanced data among thecompeting approaches. The current landscape of data preprocessing approaches have appeal dueto the ability to divide the problem space in two which allows for simpler models. However, mostof these approaches have little theoretical bases although in some cases there is empirical evidence supporting the improvement.The main goals of this research is to introduce newly proposed a priori based re-sampling methodsthat improve concept learning within class imbalanced data. The results in this work highlightthe robustness of these techniques performance within publicly available data sets from differentdomains containing various levels of imbalance. In this research the theoretical and empiricalreasons are explored and discussed.
Title: a priori synthetic sampling for increasing classification sensitivity in imbalanced data sets.
46 views
20 downloads
Name(s): Rivera, William, Author
Xanthopoulos, Petros, Committee Chair
Wiegand, Rudolf, Committee Member
Karwowski, Waldemar, Committee Member
Kincaid, John, Committee Member
University of Central Florida, Degree Grantor
Type of Resource: text
Date Issued: 2015
Publisher: University of Central Florida
Language(s): English
Abstract/Description: Building accurate classifiers for predicting group membership is made difficult when data is skewedor imbalanced which is typical of real world data sets. The classifier has the tendency to be biased towards the over represented group as a result. This imbalance is considered a class imbalance problem which will induce bias into the classifier particularly when the imbalance is high.Class imbalance data usually suffers from data intrinsic properties beyond that of imbalance alone.The problem is intensified with larger levels of imbalance most commonly found in observationalstudies. Extreme cases of class imbalance are commonly found in many domains including frauddetection, mammography of cancer and post term births. These rare events are usually the mostcostly or have the highest level of risk associated with them and are therefore of most interest.To combat class imbalance the machine learning community has relied upon embedded, data preprocessing and ensemble learning approaches. Exploratory research has linked several factorsthat perpetuate the issue of misclassification in class imbalanced data. However, there remainsa lack of understanding between the relationship of the learner and imbalanced data among thecompeting approaches. The current landscape of data preprocessing approaches have appeal dueto the ability to divide the problem space in two which allows for simpler models. However, mostof these approaches have little theoretical bases although in some cases there is empirical evidence supporting the improvement.The main goals of this research is to introduce newly proposed a priori based re-sampling methodsthat improve concept learning within class imbalanced data. The results in this work highlightthe robustness of these techniques performance within publicly available data sets from differentdomains containing various levels of imbalance. In this research the theoretical and empiricalreasons are explored and discussed.
Identifier: CFE0006169 (IID), ucf:51129 (fedora)
Note(s): 2016-05-01
Ph.D.
Engineering and Computer Science, Dean's Office GRDST
Doctoral
This record was generated from author submitted information.
Subject(s): Safe Level OUPS -- OUPS -- Class Imbalance -- Classification
Persistent Link to This Record: http://purl.flvc.org/ucf/fd/CFE0006169
Restrictions on Access: public 2016-05-15
Host Institution: UCF

In Collections