Current Search: data mining (x)
View All Items
Pages
- Title
- Data Mining Models for Tackling High Dimensional Datasets and Outliers.
- Creator
-
Panagopoulos, Orestis, Xanthopoulos, Petros, Rabelo, Luis, Zheng, Qipeng, Dechev, Damian, University of Central Florida
- Abstract / Description
-
High dimensional data and the presence of outliers in data each pose a serious challenge in supervised learning.Datasets with significantly larger number of features compared to samples arise in various areas, including business analytics and biomedical applications. Such datasets pose a serious challenge to standard statistical methods and render many existing classification techniques impractical. The generalization ability of many classification algorithms is compromised due to the so...
Show moreHigh dimensional data and the presence of outliers in data each pose a serious challenge in supervised learning.Datasets with significantly larger number of features compared to samples arise in various areas, including business analytics and biomedical applications. Such datasets pose a serious challenge to standard statistical methods and render many existing classification techniques impractical. The generalization ability of many classification algorithms is compromised due to the so-called curse of dimensionality. A new binary classification method called constrained subspace classifier (CSC) is proposed for such high dimensional datasets. CSC improves on an earlier proposed classification method called local subspace classifier (LSC) by accounting for the relative angle between subspaces while approximating the classes with individual subspaces. CSC is formulated as an optimization problem and can be solved by an efficient alternating optimization technique. Classification performance is tested in publicly available datasets. The improvement in classification accuracy over LSC shows the importance of considering the relative angle between the subspaces while approximating the classes. Additionally, CSC appears to be a robust classifier, compared to traditional two step methods that perform feature selection and classification in two distinct steps.Outliers can be present in real world datasets due to noise or measurement errors. The presence of outliers can affect the training phase of machine learning algorithms, leading to over-fitting which results in poor generalization ability. A new regression method called relaxed support vector regression (RSVR) is proposed for such datasets. RSVR is based on the concept of constraint relaxation which leads to increased robustness in datasets with outliers. RSVR is formulated using both linear and quadratic loss functions. Numerical experiments on benchmark datasets and computational comparisons with other popular regression methods depict the behavior of our proposed method. RSVR achieves better overall performance than support vector regression (SVR) in measures such as RMSE and $R^2_{adj}$ while being on par with other state-of-the-art regression methods such as robust regression (RR). Additionally, RSVR provides robustness for higher dimensional datasets which is a limitation of RR, the robust equivalent of ordinary least squares regression. Moreover, RSVR can be used on datasets that contain varying levels of noise.Lastly, we present a new novelty detection model called relaxed one-class support vector machines (ROSVMs) that deals with the problem of one-class classification in the presence of outliers.
Show less - Date Issued
- 2016
- Identifier
- CFE0006698, ucf:51920
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006698
- Title
- AN ARCHITECTURE FOR HIGH-PERFORMANCE PRIVACY-PRESERVING AND DISTRIBUTED DATA MINING.
- Creator
-
Secretan, James, Georgiopoulos, Michael, University of Central Florida
- Abstract / Description
-
This dissertation discusses the development of an architecture and associated techniques to support Privacy Preserving and Distributed Data Mining. The field of Distributed Data Mining (DDM) attempts to solve the challenges inherent in coordinating data mining tasks with databases that are geographically distributed, through the application of parallel algorithms and grid computing concepts. The closely related field of Privacy Preserving Data Mining (PPDM) adds the dimension of privacy to...
Show moreThis dissertation discusses the development of an architecture and associated techniques to support Privacy Preserving and Distributed Data Mining. The field of Distributed Data Mining (DDM) attempts to solve the challenges inherent in coordinating data mining tasks with databases that are geographically distributed, through the application of parallel algorithms and grid computing concepts. The closely related field of Privacy Preserving Data Mining (PPDM) adds the dimension of privacy to the problem, trying to find ways that organizations can collaborate to mine their databases collectively, while at the same time preserving the privacy of their records. Developing data mining algorithms for DDM and PPDM environments can be difficult and there is little software to support it. In addition, because these tasks can be computationally demanding, taking hours of even days to complete data mining tasks, organizations should be able to take advantage of high-performance and parallel computing to accelerate these tasks. Unfortunately there is no such framework that is able to provide all of these services easily for a developer. In this dissertation such a framework is developed to support the creation and execution of DDM and PPDM applications, called APHID (Architecture for Private, High-performance Integrated Data mining). The architecture allows users to flexibly and seamlessly integrate cluster and grid resources into their DDM and PPDM applications. The architecture is scalable, and is split into highly de-coupled services to ensure flexibility and extensibility. This dissertation first develops a comprehensive example algorithm, a privacy-preserving Probabilistic Neural Network (PNN), which serves a basis for analysis of the difficulties of DDM/PPDM development. The privacy-preserving PNN is the first such PNN in the literature, and provides not only a practical algorithm ready for use in privacy-preserving applications, but also a template for other data intensive algorithms, and a starting point for analyzing APHID's architectural needs. After analyzing the difficulties in the PNN algorithm's development, as well as the shortcomings of researched systems, this dissertation presents the first concrete programming model joining high performance computing resources with a privacy preserving data mining process. Unlike many of the existing PPDM development models, the platform of services is language independent, allowing layers and algorithms to be implemented in popular languages (Java, C++, Python, etc.). An implementation of a PPDM algorithm is developed in Java utilizing the new framework. Performance results are presented, showing that APHID can enable highly simplified PPDM development while speeding up resource intensive parts of the algorithm.
Show less - Date Issued
- 2009
- Identifier
- CFE0002853, ucf:48076
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002853
- Title
- DETECTING MALICIOUS SOFTWARE BY DYNAMICEXECUTION.
- Creator
-
Dai, Jianyong, Guha, Ratan, University of Central Florida
- Abstract / Description
-
Traditional way to detect malicious software is based on signature matching. However, signature matching only detects known malicious software. In order to detect unknown malicious software, it is necessary to analyze the software for its impact on the system when the software is executed. In one approach, the software code can be statically analyzed for any malicious patterns. Another approach is to execute the program and determine the nature of the program dynamically. Since the execution...
Show moreTraditional way to detect malicious software is based on signature matching. However, signature matching only detects known malicious software. In order to detect unknown malicious software, it is necessary to analyze the software for its impact on the system when the software is executed. In one approach, the software code can be statically analyzed for any malicious patterns. Another approach is to execute the program and determine the nature of the program dynamically. Since the execution of malicious code may have negative impact on the system, the code must be executed in a controlled environment. For that purpose, we have developed a sandbox to protect the system. Potential malicious behavior is intercepted by hooking Win32 system calls. Using the developed sandbox, we detect unknown virus using dynamic instruction sequences mining techniques. By collecting runtime instruction sequences in basic blocks, we extract instruction sequence patterns based on instruction associations. We build classification models with these patterns. By applying this classification model, we predict the nature of an unknown program. We compare our approach with several other approaches such as simple heuristics, NGram and static instruction sequences. We have also developed a method to identify a family of malicious software utilizing the system call trace. We construct a structural system call diagram from captured dynamic system call traces. We generate smart system call signature using profile hidden Markov model (PHMM) based on modularized system call block. Smart system call signature weakly identifies a family of malicious software.
Show less - Date Issued
- 2009
- Identifier
- CFE0002798, ucf:48141
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002798
- Title
- Learning Collective Behavior in Multi-relational Networks.
- Creator
-
Wang, Xi, Sukthankar, Gita, Tappen, Marshall, Georgiopoulos, Michael, Hu, Haiyan, Anagnostopoulos, Georgios, University of Central Florida
- Abstract / Description
-
With the rapid expansion of the Internet and WWW, the problem of analyzing social media data has received an increasing amount of attention in the past decade. The boom in social media platforms offers many possibilities to study human collective behavior and interactions on an unprecedented scale. In the past, much work has been done on the problem of learning from networked data with homogeneous topologies, where instances are explicitly or implicitly inter-connected by a single type of...
Show moreWith the rapid expansion of the Internet and WWW, the problem of analyzing social media data has received an increasing amount of attention in the past decade. The boom in social media platforms offers many possibilities to study human collective behavior and interactions on an unprecedented scale. In the past, much work has been done on the problem of learning from networked data with homogeneous topologies, where instances are explicitly or implicitly inter-connected by a single type of relationship. In contrast to traditional content-only classification methods, relational learning succeeds in improving classification performance by leveraging the correlation of the labels between linked instances. However, networked data extracted from social media, web pages, and bibliographic databases can contain entities of multiple classes and linked by various causal reasons, hence treating all links in a homogeneous way can limit the performance of relational classifiers. Learning the collective behavior and interactions in heterogeneous networks becomes much more complex.The contribution of this dissertation include 1) two classification frameworks for identifying human collective behavior in multi-relational social networks; 2) unsupervised and supervised learning models for relationship prediction in multi-relational collaborative networks. Our methods improve the performance of homogeneous predictive models by differentiating heterogeneous relations and capturing the prominent interaction patterns underlying the network structure. The work has been evaluated in various real-world social networks. We believe that this study will be useful for analyzing human collective behavior and interactions specifically in the scenario when the heterogeneous relationships in the network arise from various causal reasons.
Show less - Date Issued
- 2014
- Identifier
- CFE0005439, ucf:50376
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005439
- Title
- MODIFICATIONS TO THE FUZZY-ARTMAP ALGORITHM FOR DISTRIBUTED LEARNING IN LARGE DATA SETS.
- Creator
-
Castro, Jose R, Georgiopoulos, Michael, University of Central Florida
- Abstract / Description
-
The Fuzzy-ARTMAP (FAM) algorithm is one of the premier neural network architectures for classification problems. FAM can learn on line and is usually faster than other neural network approaches. Nevertheless the learning time of FAM can slow down considerably when the size of the training set increases into the hundreds of thousands. We apply data partitioning and networkpartitioning to the FAM algorithm in a sequential and parallel settingto achieve better convergence time and to efficiently...
Show moreThe Fuzzy-ARTMAP (FAM) algorithm is one of the premier neural network architectures for classification problems. FAM can learn on line and is usually faster than other neural network approaches. Nevertheless the learning time of FAM can slow down considerably when the size of the training set increases into the hundreds of thousands. We apply data partitioning and networkpartitioning to the FAM algorithm in a sequential and parallel settingto achieve better convergence time and to efficiently train withlarge databases (hundreds of thousands of patterns).Our parallelization is implemented on a Beowulf clusters of workstations. Two data partitioning approaches and two networkpartitioning approaches are developed. Extensive testing of all the approaches is done on three large datasets (half a milliondata points). One of them is the Forest Covertype database from Blackard and the other two are artificially generated Gaussian data with different percentages of overlap between classes.Speedups in the data partitioning approach reached the order of the hundreds without having to invest in parallel computation. Speedups onthe network partitioning approach are close to linear on a cluster of workstations. Both methods allowed us to reduce the computation time of training the neural network in large databases from days to minutes. We prove formally that the workload balance of our network partitioning approaches will never be worse than an acceptable bound, and also demonstrate the correctness of these parallelization variants of FAM.
Show less - Date Issued
- 2004
- Identifier
- CFE0000065, ucf:46092
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000065
- Title
- STABLE OPTICAL FREQUENCY COMB GENERATION AND APPLICATIONS IN ARBITRARY WAVEFORM GENERATION, SIGNAL PROCESSING AND OPTICAL DATA MINING.
- Creator
-
Ozharar, Sarper, Delfyett, Peter, University of Central Florida
- Abstract / Description
-
This thesis focuses on the generation and applications of stable optical frequency combs. Optical frequency combs are defined as equally spaced optical frequencies with a fixed phase relation among themselves. The conventional source of optical frequency combs is the optical spectrum of the modelocked lasers. In this work, we investigated alternative methods for optical comb generation, such as dual sine wave phase modulation, which is more practical and cost effective compared to modelocked...
Show moreThis thesis focuses on the generation and applications of stable optical frequency combs. Optical frequency combs are defined as equally spaced optical frequencies with a fixed phase relation among themselves. The conventional source of optical frequency combs is the optical spectrum of the modelocked lasers. In this work, we investigated alternative methods for optical comb generation, such as dual sine wave phase modulation, which is more practical and cost effective compared to modelocked lasers stabilized to a reference. Incorporating these comblines, we have generated tunable RF tones using the serrodyne technique. The tuning range was ±1 MHz, limited by the electronic waveform generator, and the RF carrier frequency is limited by the bandwidth of the photodetector. Similarly, using parabolic phase modulation together with time division multiplexing, RF chirp extension has been realized. Another application of the optical frequency combs studied in this thesis is real time data mining in a bit stream. A novel optoelectronic logic gate has been developed for this application and used to detect an 8 bit long target pattern. Also another approach based on orthogonal Hadamard codes have been proposed and explained in detail. Also novel intracavity modulation schemes have been investigated and applied for various applications such as a) improving rational harmonic modelocking for repetition rate multiplication and pulse to pulse amplitude equalization, b) frequency skewed pulse generation for ranging and c) intracavity active phase modulation in amplitude modulated modelocked lasers for supermode noise spur suppression and integrated jitter reduction. The thesis concludes with comments on the future work and next steps to improve some of the results presented in this work.
Show less - Date Issued
- 2008
- Identifier
- CFE0002388, ucf:47744
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002388
- Title
- SESSION-BASED INTRUSION DETECTION SYSTEM TO MAP ANOMALOUS NETWORK TRAFFIC.
- Creator
-
Caulkins, Bruce, Wang, Morgan, University of Central Florida
- Abstract / Description
-
Computer crime is a large problem (CSI, 2004; Kabay, 2001a; Kabay, 2001b). Security managers have a variety of tools at their disposal firewalls, Intrusion Detection Systems (IDSs), encryption, authentication, and other hardware and software solutions to combat computer crime. Many IDS variants exist which allow security managers and engineers to identify attack network packets primarily through the use of signature detection; i.e., the IDS recognizes attack packets due to their well...
Show moreComputer crime is a large problem (CSI, 2004; Kabay, 2001a; Kabay, 2001b). Security managers have a variety of tools at their disposal firewalls, Intrusion Detection Systems (IDSs), encryption, authentication, and other hardware and software solutions to combat computer crime. Many IDS variants exist which allow security managers and engineers to identify attack network packets primarily through the use of signature detection; i.e., the IDS recognizes attack packets due to their well-known "fingerprints" or signatures as those packets cross the network's gateway threshold. On the other hand, anomaly-based ID systems determine what is normal traffic within a network and reports abnormal traffic behavior. This paper will describe a methodology towards developing a more-robust Intrusion Detection System through the use of data-mining techniques and anomaly detection. These data-mining techniques will dynamically model what a normal network should look like and reduce the false positive and false negative alarm rates in the process. We will use classification-tree techniques to accurately predict probable attack sessions. Overall, our goal is to model network traffic into network sessions and identify those network sessions that have a high-probability of being an attack and can be labeled as a "suspect session." Subsequently, we will use these techniques inclusive of signature detection methods, as they will be used in concert with known signatures and patterns in order to present a better model for detection and protection of networks and systems.
Show less - Date Issued
- 2005
- Identifier
- CFE0000906, ucf:46762
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000906
- Title
- STUDENT PERCEPTION OF GENERAL EDUCATION PROGRAM COURSES.
- Creator
-
Pepe, Julie, Witta, Lea, University of Central Florida
- Abstract / Description
-
The purposes of this study were to: (a) determine, for General Education Program (GEP) courses, what individual items on the student form are predictive of the overall instructor rating value; (b) investigate the relationship of instructional mode, class size, GEP foundational area, and GEP theme with the overall instructor rating value; (c) examine what teacher/course qualities are related to a high (Excellent) overall evaluation or a low (Poor) overall evaluation value. The data set used...
Show moreThe purposes of this study were to: (a) determine, for General Education Program (GEP) courses, what individual items on the student form are predictive of the overall instructor rating value; (b) investigate the relationship of instructional mode, class size, GEP foundational area, and GEP theme with the overall instructor rating value; (c) examine what teacher/course qualities are related to a high (Excellent) overall evaluation or a low (Poor) overall evaluation value. The data set used for analysis contained sixteen student response scores (Q1-Q16), response number, class size, term, foundational area (communication, cultural/historical, mathematics, social, or science), GEP theme (yes/no), instructional mode (face-to-face or other), and percent responding (calculated value). All identifying information such as department, course, section, and instructor was removed from the analysis file. The final data set contained 23 variables, 8,065 course sections, and 294,692 student responses. All individual items on the student evaluation form were related to the overall evaluation item score, measured using SpearmanÃÂ's correlation coefficients. None of the examined course variables were selected as significant when the individual form items were included in the modeling process. This indicated students employed a consistent approach to the evaluation process regardless of large or small classes, face-to-face or other instructional modes, foundational area, or percent responding differences. Data mining modeling techniques were used to understand the relationship of individual item responses and additional course information variables to the overall score. Items one to fifteen (Q1 to Q15), class size, instructional mode, foundational area, and GEP theme were the independent variables used to find splits to create homogenous groups in relation to the overall evaluation score. The model results are presented in terms of if-then rules for ÃÂ"ExcellentÃÂ" or ÃÂ"PoorÃÂ" overall evaluation scores. The top three rules for ÃÂ"ExcellentÃÂ" or ÃÂ"PoorÃÂ" based their classifications on some combination of the following items: communication of ideas and information; facilitation of learning; respect and concern for students; instructorÃÂ's overall organization of the course; instructorÃÂ's interest in your learning; instructorÃÂ's assessment of your progress in the course; and stimulation of interest in the course. Proportion of student responses conforming to the top three rules for ÃÂ"ExcellentÃÂ" or ÃÂ"PoorÃÂ" overall evaluation ranged from 0.89 to .60. These findings suggest that students reward, with higher evaluation scores, instructors who they perceive as organized and strive to clearly communicate course content. These characteristics can be improved through mentoring or professional development workshops for instructors. Additionally, instructors of GEP courses need to be informed that students connect respect and concern and having an interest in student learning with the overall score they give the instructor.
Show less - Date Issued
- 2010
- Identifier
- CFE0003289, ucf:48519
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0003289
- Title
- Student Community Detection and Recommendation of Customized Paths to Reinforce Academic Success.
- Creator
-
Shao, Yuan, Jha, Sumit Kumar, Zhang, Wei, Zhang, Shaojie, University of Central Florida
- Abstract / Description
-
Educational Data Mining (EDM) is a research area that analyzes educational data and extracts interesting and unique information to address education issues. EDM implements computational methods to explore data for the purpose of studying questions related to educational achievements. A common task in an educational environment is the grouping of students and the identification of communities that have common features. Then, these communities of students may be studied by a course developer to...
Show moreEducational Data Mining (EDM) is a research area that analyzes educational data and extracts interesting and unique information to address education issues. EDM implements computational methods to explore data for the purpose of studying questions related to educational achievements. A common task in an educational environment is the grouping of students and the identification of communities that have common features. Then, these communities of students may be studied by a course developer to build a personalized learning system, promote effective group learning, provide adaptive contents, etc. The objective of this thesis is to find an approach to detect student communities and analyze students who do well academically with particular sequences of classes in each community. Then, we compute one or more sequences of courses that a student in a community may pursue to higher their chances of obtaining good academic performance.
Show less - Date Issued
- 2019
- Identifier
- CFE0007529, ucf:52623
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007529
- Title
- ESTIMATION OF HYBRID MODELS FOR REAL-TIME CRASH RISK ASSESSMENT ON FREEWAYS.
- Creator
-
pande, anurag, Abdel-Aty, Mohamed, University of Central Florida
- Abstract / Description
-
Relevance of reactive traffic management strategies such as freeway incident detection has been diminishing with advancements in mobile phone usage and video surveillance technology. On the other hand, capacity to collect, store, and analyze traffic data from underground loop detectors has witnessed enormous growth in the recent past. These two facts together provide us with motivation as well as the means to shift the focus of freeway traffic management toward proactive strategies that would...
Show moreRelevance of reactive traffic management strategies such as freeway incident detection has been diminishing with advancements in mobile phone usage and video surveillance technology. On the other hand, capacity to collect, store, and analyze traffic data from underground loop detectors has witnessed enormous growth in the recent past. These two facts together provide us with motivation as well as the means to shift the focus of freeway traffic management toward proactive strategies that would involve anticipating incidents such as crashes. The primary element of proactive traffic management strategy would be model(s) that can separate 'crash prone' conditions from 'normal' traffic conditions in real-time. The aim in this research is to establish relationship(s) between historical crashes of specific types and corresponding loop detector data, which may be used as the basis for classifying real-time traffic conditions into 'normal' or 'crash prone' in the future. In this regard traffic data in this study were also collected for cases which did not lead to crashes (non-crash cases) so that the problem may be set up as a binary classification. A thorough review of the literature suggested that existing real-time crash 'prediction' models (classification or otherwise) are generic in nature, i.e., a single model has been used to identify all crashes (such as rear-end, sideswipe, or angle), even though traffic conditions preceding crashes are known to differ by type of crash. Moreover, a generic model would yield no information about the collision most likely to occur. To be able to analyze different groups of crashes independently, a large database of crashes reported during the 5-year period from 1999 through 2003 on Interstate-4 corridor in Orlando were collected. The 36.25-mile instrumented corridor is equipped with 69 dual loop detector stations in each direction (eastbound and westbound) located approximately every ½ mile. These stations report speed, volume, and occupancy data every 30-seconds from the three through lanes of the corridor. Geometric design parameters for the freeway were also collected and collated with historical crash and corresponding loop detector data. The first group of crashes to be analyzed were the rear-end crashes, which account to about 51% of the total crashes. Based on preliminary explorations of average traffic speeds; rear-end crashes were grouped into two mutually exclusive groups. First, those occurring under extended congestion (referred to as regime 1 traffic conditions) and the other which occurred with relatively free-flow conditions (referred to as regime 2 traffic conditions) prevailing 5-10 minutes before the crash. Simple rules to separate these two groups of rear-end crashes were formulated based on the classification tree methodology. It was found that the first group of rear-end crashes can be attributed to parameters measurable through loop detectors such as the coefficient of variation in speed and average occupancy at stations in the vicinity of crash location. For the second group of rear-end crashes (referred to as regime 2) traffic parameters such as average speed and occupancy at stations downstream of the crash location were significant along with off-line factors such as the time of day and presence of an on-ramp in the downstream direction. It was found that regime 1 traffic conditions make up only about 6% of the traffic conditions on the freeway. Almost half of rear-end crashes occurred under regime 1 traffic regime even with such little exposure. This observation led to the conclusion that freeway locations operating under regime 1 traffic may be flagged for (rear-end) crashes without any further investigation. MLP (multilayer perceptron) and NRBF (normalized radial basis function) neural network architecture were explored to identify regime 2 rear-end crashes. The performance of individual neural network models was improved by hybridizing their outputs. Individual and hybrid PNN (probabilistic neural network) models were also explored along with matched case control logistic regression. The stepwise selection procedure yielded the matched logistic regression model indicating the difference between average speeds upstream and downstream as significant. Even though the model provided good interpretation, its classification accuracy over the validation dataset was far inferior to the hybrid MLP/NRBF and PNN models. Hybrid neural network models along with classification tree model (developed to identify the traffic regimes) were able to identify about 60% of the regime 2 rear-end crashes in addition to all regime 1 rear-end crashes with a reasonable number of positive decisions (warnings). It translates into identification of more than ¾ (77%) of all rear-end crashes. Classification models were then developed for the next most frequent type, i.e., lane change related crashes. Based on preliminary analysis, it was concluded that the location specific characteristics, such as presence of ramps, mile-post location, etc. were not significantly associated with these crashes. Average difference between occupancies of adjacent lanes and average speeds upstream and downstream of the crash location were found significant. The significant variables were then subjected as inputs to MLP and NRBF based classifiers. The best models in each category were hybridized by averaging their respective outputs. The hybrid model significantly improved on the crash identification achieved through individual models and 57% of the crashes in the validation dataset could be identified with 30% warnings. Although the hybrid models in this research were developed with corresponding data for rear-end and lane-change related crashes only, it was observed that about 60% of the historical single vehicle crashes (other than rollovers) could also be identified using these models. The majority of the identified single vehicle crashes, according to the crash reports, were caused due to evasive actions by the drivers in order to avoid another vehicle in front or in the other lane. Vehicle rollover crashes were found to be associated with speeding and curvature of the freeway section; the established relationship, however, was not sufficient to identify occurrence of these crashes in real-time. Based on the results from modeling procedure, a framework for parallel real-time application of these two sets of models (rear-end and lane-change) in the form of a system was proposed. To identify rear-end crashes, the data are first subjected to classification tree based rules to identify traffic regimes. If traffic patterns belong to regime 1, a rear-end crash warning is issued for the location. If the patterns are identified to be regime 2, then they are subjected to hybrid MLP/NRBF model employing traffic data from five surrounding traffic stations. If the model identifies the patterns as crash prone then the location may be flagged for rear-end crash, otherwise final check for a regime 2 rear-end crash is applied on the data through the hybrid PNN model. If data from five stations are not available due to intermittent loop failures, the system is provided with the flexibility to switch to models with more tolerant data requirements (i.e., model using traffic data from only one station or three stations). To assess the risk of a lane-change related crash, if all three lanes at the immediate upstream station are functioning, the hybrid of the two of the best individual neural network models (NRBF with three hidden neurons and MLP with four hidden neurons) is applied to the input data. A warning for a lane-change related crash may be issued based on its output. The proposed strategy is demonstrated over a complete day of loop data in a virtual real-time application. It was shown that the system of models may be used to continuously assess and update the risk for rear-end and lane-change related crashes. The system developed in this research should be perceived as the primary component of proactive traffic management strategy. Output of the system along with the knowledge of variables critically associated with specific types of crashes identified in this research can be used to formulate ways for avoiding impending crashes. However, specific crash prevention strategies e.g., variable speed limit and warnings to the commuters demand separate attention and should be addressed through thorough future research.
Show less - Date Issued
- 2005
- Identifier
- CFE0000842, ucf:46659
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000842
- Title
- Predicting Students' Academic Performance with Decision Tree and Neural Network.
- Creator
-
Feng, Junshuai, Jha, Sumit Kumar, Zhang, Wei, Zhang, Shaojie, University of Central Florida
- Abstract / Description
-
Educational Data Mining (EDM) is a developing research field that involves many techniques to explore data relating to educational background. EDM can analyze and resolve educational data with computational methods to address educational questions. Similar to EDM, neural networks have been utilized in widespread and successful data mining applications. In this paper, synthetic datasets are employed since this paper aims to explore the methodologies such as decision tree classifiers and neural...
Show moreEducational Data Mining (EDM) is a developing research field that involves many techniques to explore data relating to educational background. EDM can analyze and resolve educational data with computational methods to address educational questions. Similar to EDM, neural networks have been utilized in widespread and successful data mining applications. In this paper, synthetic datasets are employed since this paper aims to explore the methodologies such as decision tree classifiers and neural networks to predict student performance in the context of EDM. Firstly, it introduces EDM and some relative works that have been accomplished previously in this field along with their datasets and computational results. Then, it demonstrates how the synthetic student dataset is generated, analyzes some input attributes from the dataset such as gender and high school GPA, and delivers with some visualization results to determine which classification methods approaches are the most efficient. After testing the data with decision tree classifiers and neural networks methodologies, it concludes the effectiveness of both approaches in terms of the model evaluation performance as well as discussing some of the most promising future work of this research.
Show less - Date Issued
- 2019
- Identifier
- CFE0007455, ucf:52680
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007455
- Title
- Integrated Data Fusion and Mining (IDFM) Technique for Monitoring Water Quality in Large and Small Lakes.
- Creator
-
Vannah, Benjamin, Chang, Ni-bin, Wanielista, Martin, Wang, Dingbao, University of Central Florida
- Abstract / Description
-
Monitoring water quality on a near-real-time basis to address water resources management and public health concerns in coupled natural systems and the built environment is by no means an easy task. Furthermore, this emerging societal challenge will continue to grow, due to the ever-increasing anthropogenic impacts upon surface waters. For example, urban growth and agricultural operations have led to an influx of nutrients into surface waters stimulating harmful algal bloom formation, and...
Show moreMonitoring water quality on a near-real-time basis to address water resources management and public health concerns in coupled natural systems and the built environment is by no means an easy task. Furthermore, this emerging societal challenge will continue to grow, due to the ever-increasing anthropogenic impacts upon surface waters. For example, urban growth and agricultural operations have led to an influx of nutrients into surface waters stimulating harmful algal bloom formation, and stormwater runoff from urban areas contributes to the accumulation of total organic carbon (TOC) in surface waters. TOC in surface waters is a known precursor of disinfection byproducts in drinking water treatment, and microcystin is a potent hepatotoxin produced by the bacteria Microcystis, which can form expansive algal blooms in eutrophied lakes. Due to the ecological impacts and human health hazards posed by TOC and microcystin, it is imperative that municipal decision makers and water treatment plant operators are equipped with a rapid and economical means to track and measure these substances.Remote sensing is an emergent solution for monitoring and measuring changes to the earth's environment. This technology allows for large regions anywhere on the globe to be observed on a frequent basis. This study demonstrates the prototype of a near-real-time early warning system using Integrated Data Fusion and Mining (IDFM) techniques with the aid of both multispectral (Landsat and MODIS) and hyperspectral (MERIS) satellite sensors to determine spatiotemporal distributions of TOC and microcystin. Landsat satellite imageries have high spatial resolution, but such application suffers from a long overpass interval of 16 days. On the other hand, free coarse resolution sensors with daily revisit times, such as MODIS, are incapable of providing detailed water quality information because of low spatial resolution. This issue can be resolved by using data or sensor fusion techniques, an instrumental part of IDFM, in which the high spatial resolution of Landsat and the high temporal resolution of MODIS imageries are fused and analyzed by a suite of regression models to optimally produce synthetic images with both high spatial and temporal resolutions. The same techniques are applied to the hyperspectral sensor MERIS with the aid of the MODIS ocean color bands to generate fused images with enhanced spatial, temporal, and spectral properties. The performance of the data mining models derived using fused hyperspectral and fused multispectral data are quantified using four statistical indices. The second task compared traditional two-band models against more powerful data mining models for TOC and microcystin prediction. The use of IDFM is illustrated for monitoring microcystin concentrations in Lake Erie (large lake), and it is applied for TOC monitoring in Harsha Lake (small lake). Analysis confirmed that data mining methods excelled beyond two-band models at accurately estimating TOC and microcystin concentrations in lakes, and the more detailed spectral reflectance data offered by hyperspectral sensors produced a noticeable increase in accuracy for the retrieval of water quality parameters.
Show less - Date Issued
- 2013
- Identifier
- CFE0005066, ucf:49979
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005066
- Title
- HIGH PERFORMANCE DATA MINING TECHNIQUES FOR INTRUSION DETECTION.
- Creator
-
Siddiqui, Muazzam Ahmed, Lee, Joohan, University of Central Florida
- Abstract / Description
-
The rapid growth of computers transformed the way in which information and data was stored. With this new paradigm of data access, comes the threat of this information being exposed to unauthorized and unintended users. Many systems have been developed which scrutinize the data for a deviation from the normal behavior of a user or system, or search for a known signature within the data. These systems are termed as Intrusion Detection Systems (IDS). These systems employ different techniques...
Show moreThe rapid growth of computers transformed the way in which information and data was stored. With this new paradigm of data access, comes the threat of this information being exposed to unauthorized and unintended users. Many systems have been developed which scrutinize the data for a deviation from the normal behavior of a user or system, or search for a known signature within the data. These systems are termed as Intrusion Detection Systems (IDS). These systems employ different techniques varying from statistical methods to machine learning algorithms.Intrusion detection systems use audit data generated by operating systems, application softwares or network devices. These sources produce huge amount of datasets with tens of millions of records in them. To analyze this data, data mining is used which is a process to dig useful patterns from a large bulk of information. A major obstacle in the process is that the traditional data mining and learning algorithms are overwhelmed by the bulk volume and complexity of available data. This makes these algorithms impractical for time critical tasks like intrusion detection because of the large execution time.Our approach towards this issue makes use of high performance data mining techniques to expedite the process by exploiting the parallelism in the existing data mining algorithms and the underlying hardware. We will show that how high performance and parallel computing can be used to scale the data mining algorithms to handle large datasets, allowing the data mining component to search a much larger set of patterns and models than traditional computational platforms and algorithms would allow.We develop parallel data mining algorithms by parallelizing existing machine learning techniques using cluster computing. These algorithms include parallel backpropagation and parallel fuzzy ARTMAP neural networks. We evaluate the performances of the developed models in terms of speedup over traditional algorithms, prediction rate and false alarm rate. Our results showed that the traditional backpropagation and fuzzy ARTMAP algorithms can benefit from high performance computing techniques which make them well suited for time critical tasks like intrusion detection.
Show less - Date Issued
- 2004
- Identifier
- CFE0000056, ucf:46142
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000056
- Title
- ANALYZING THE COMMUNITY STRUCTURE OF WEB-LIKE NETWORKS: MODELS AND ALGORITHMS.
- Creator
-
Cami, Aurel, Deo, Narsingh, University of Central Florida
- Abstract / Description
-
This dissertation investigates the community structure of web-like networks (i.e., large, random, real-life networks such as the World Wide Web and the Internet). Recently, it has been shown that many such networks have a locally dense and globally sparse structure with certain small, dense subgraphs occurring much more frequently than they do in the classical Erdös-Rényi random graphs. This peculiarity--which is commonly referred to as community structure--has been observed in...
Show moreThis dissertation investigates the community structure of web-like networks (i.e., large, random, real-life networks such as the World Wide Web and the Internet). Recently, it has been shown that many such networks have a locally dense and globally sparse structure with certain small, dense subgraphs occurring much more frequently than they do in the classical Erdös-Rényi random graphs. This peculiarity--which is commonly referred to as community structure--has been observed in seemingly unrelated networks such as the Web, email networks, citation networks, biological networks, etc. The pervasiveness of this phenomenon has led many researchers to believe that such cohesive groups of nodes might represent meaningful entities. For example, in the Web such tightly-knit groups of nodes might represent pages with a common topic, geographical location, etc., while in the neural networks they might represent evolved computational units. The notion of community has emerged in an effort to formalize the empirical observation of the locally dense globally sparse structure of web-like networks. In the broadest sense, a community in a web-like network is defined as a group of nodes that induces a dense subgraph which is sparsely linked with the rest of the network. Due to a wide array of envisioned applications, ranging from crawlers and search engines to network security and network compression, there has recently been a widespread interest in finding efficient community-mining algorithms. In this dissertation, the community structure of web-like networks is investigated by a combination of analytical and computational techniques: First, we consider the problem of modeling the web-like networks. In the recent years, many new random graph models have been proposed to account for some recently discovered properties of web-like networks that distinguish them from the classical random graphs. The vast majority of these random graph models take into account only the addition of new nodes and edges. Yet, several empirical observations indicate that deletion of nodes and edges occurs frequently in web-like networks. Inspired by such observations, we propose and analyze two dynamic random graph models that combine node and edge addition with a uniform and a preferential deletion of nodes, respectively. In both cases, we find that the random graphs generated by such models follow power-law degree distributions (in agreement with the degree distribution of many web-like networks). Second, we analyze the expected density of certain small subgraphs--such as defensive alliances on three and four nodes--in various random graphs models. Our findings show that while in the binomial random graph the expected density of such subgraphs is very close to zero, in some dynamic random graph models it is much larger. These findings converge with our results obtained by computing the number of communities in some Web crawls. Next, we investigate the computational complexity of the community-mining problem under various definitions of community. Assuming the definition of community as a global defensive alliance, or a global offensive alliance we prove--using transformations from the dominating set problem--that finding optimal communities is an NP-complete problem. These and other similar complexity results coupled with the fact that many web-like networks are huge, indicate that it is unlikely that fast, exact sequential algorithms for mining communities may be found. To handle this difficulty we adopt an algorithmic definition of community and a simpler version of the community-mining problem, namely: find the largest community to which a given set of seed nodes belong. We propose several greedy algorithms for this problem: The first proposed algorithm starts out with a set of seed nodes--the initial community--and then repeatedly selects some nodes from community's neighborhood and pulls them in the community. In each step, the algorithm uses clustering coefficient--a parameter that measures the fraction of the neighbors of a node that are neighbors themselves--to decide which nodes from the neighborhood should be pulled in the community. This algorithm has time complexity of order , where denotes the number of nodes visited by the algorithm and is the maximum degree encountered. Thus, assuming a power-law degree distribution this algorithm is expected to run in near-linear time. The proposed algorithm achieved good accuracy when tested on some real and computer-generated networks: The fraction of community nodes classified correctly is generally above 80% and often above 90% . A second algorithm based on a generalized clustering coefficient, where not only the first neighborhood is taken into account but also the second, the third, etc., is also proposed. This algorithm achieves a better accuracy than the first one but also runs slower. Finally, a randomized version of the second algorithm which improves the time complexity without affecting the accuracy significantly, is proposed. The main target application of the proposed algorithms is focused crawling--the selective search for web pages that are relevant to a pre-defined topic.
Show less - Date Issued
- 2005
- Identifier
- CFE0000900, ucf:46726
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0000900
- Title
- Investigating and Facilitating the Transferability of Safety Performance Functions.
- Creator
-
Farid, Ahmed Tarek Ahmed, Abdel-Aty, Mohamed, Lee, JaeYoung, Eluru, Naveen, University of Central Florida
- Abstract / Description
-
Safety performance functions (SPFs) are essential analytical tools in the road safety field. The SPFs are statistical regression models used to predict crash counts by roadway facility type, crash type and severity. The national Highway Safety Manual (HSM) is a generic guidebook used for road safety evaluation and enhancement. In it, default SPFs, developed using negative binomial (NB) regression, are provided for multiple facility types and crash categories. Roadway agencies, whether public...
Show moreSafety performance functions (SPFs) are essential analytical tools in the road safety field. The SPFs are statistical regression models used to predict crash counts by roadway facility type, crash type and severity. The national Highway Safety Manual (HSM) is a generic guidebook used for road safety evaluation and enhancement. In it, default SPFs, developed using negative binomial (NB) regression, are provided for multiple facility types and crash categories. Roadway agencies, whether public or private, may opt to not invest their resources in data collection and processing to develop own localized SPFs. Instead, the agencies may adopt the HSM's. However, the HSM's SPFs may not necessarily be applicable to any conditions. Hence, this research is focused on SPF transferability, specifically for rural divided multilane highway segments. Use of Bayesian informative priors to aid in the transferability of NB SPFs, developed for Florida, to California's conditions and vice versa is investigated. It is demonstrated that informative priors facilitate SPF transferability. Furthermore, NB SPFs are developed for Florida, Ohio, Illinois, Minnesota, California, Washington and North Carolina. That is to evaluate the transferability of each state's SPFs to the other states' conditions. The results indicate that Ohio, Illinois, Minnesota and California have SPFs that are transferable to conditions of each of the four states. Also, two methods are proposed for calibrating transferred SPFs to the destinations' conditions and are shown to outperform the SPF calibration methods in the road safety literature. Finally, a variety of modeling frameworks are proposed for developing and transferring SPFs of the seven aforementioned states to each state's data. Not a single model exhibits the best fit when transferred in all cases. However, the Tobit model, NB model and a hybrid model that coalesces the results of both perform the best in a substantial number of the transferred SPFs.
Show less - Date Issued
- 2018
- Identifier
- CFE0007000, ucf:52054
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007000
- Title
- SCALABLE AND EFFICIENT OUTLIER DETECTION IN LARGE DISTRIBUTED DATA SETS WITH MIXED-TYPE ATTRIBUTES.
- Creator
-
Koufakou, Anna, Georgiopoulos, Michael, University of Central Florida
- Abstract / Description
-
An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For...
Show moreAn important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For example, in credit card transaction data, outliers might indicate potential fraud; in network traffic data, outliers might represent potential intrusion attempts. The basis of deciding if a data point is an outlier is often some measure or notion of dissimilarity between the data point under consideration and the rest. Traditional outlier detection methods assume numerical or ordinal data, and compute pair-wise distances between data points. However, the notion of distance or similarity for categorical data is more difficult to define. Moreover, the size of currently available data sets dictates the need for fast and scalable outlier detection methods, thus precluding distance computations. Additionally, these methods must be applicable to data which might be distributed among different locations. In this work, we propose novel strategies to efficiently deal with large distributed data containing mixed-type attributes. Specifically, we first propose a fast and scalable algorithm for categorical data (AVF), and its parallel version based on MapReduce (MR-AVF). We extend AVF and introduce a fast outlier detection algorithm for large distributed data with mixed-type attributes (ODMAD). Finally, we modify ODMAD in order to deal with very high-dimensional categorical data. Experiments with large real-world and synthetic data show that the proposed methods exhibit large performance gains and high scalability compared to the state-of-the-art, while achieving similar accuracy detection rates.
Show less - Date Issued
- 2009
- Identifier
- CFE0002734, ucf:48161
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002734
- Title
- An Unsupervised Consensus Control Chart Pattern Recognition Framework.
- Creator
-
Haghtalab, Siavash, Xanthopoulos, Petros, Pazour, Jennifer, Rabelo, Luis, University of Central Florida
- Abstract / Description
-
Early identification and detection of abnormal time series patterns is vital for a number of manufacturing.Slide shifts and alterations of time series patterns might be indicative of some anomalyin the production process, such as machinery malfunction. Usually due to the continuous flow of data monitoring of manufacturing processes requires automated Control Chart Pattern Recognition(CCPR) algorithms. The majority of CCPR literature consists of supervised classification algorithms. Less...
Show moreEarly identification and detection of abnormal time series patterns is vital for a number of manufacturing.Slide shifts and alterations of time series patterns might be indicative of some anomalyin the production process, such as machinery malfunction. Usually due to the continuous flow of data monitoring of manufacturing processes requires automated Control Chart Pattern Recognition(CCPR) algorithms. The majority of CCPR literature consists of supervised classification algorithms. Less studies consider unsupervised versions of the problem. Despite the profound advantageof unsupervised methodology for less manual data labeling their use is limited due to thefact that their performance is not robust enough for practical purposes. In this study we propose the use of a consensus clustering framework. Computational results show robust behavior compared to individual clustering algorithms.
Show less - Date Issued
- 2014
- Identifier
- CFE0005178, ucf:50670
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005178
- Title
- Data-Driven Simulation Modeling of Construction and Infrastructure Operations Using Process Knowledge Discovery.
- Creator
-
Akhavian, Reza, Behzadan, Amir, Oloufa, Amr, Yun, Hae-Bum, Sukthankar, Gita, Zheng, Qipeng, University of Central Florida
- Abstract / Description
-
Within the architecture, engineering, and construction (AEC) domain, simulation modeling is mainly used to facilitate decision-making by enabling the assessment of different operational plans and resource arrangements, that are otherwise difficult (if not impossible), expensive, or time consuming to be evaluated in real world settings. The accuracy of such models directly affects their reliability to serve as a basis for important decisions such as project completion time estimation and...
Show moreWithin the architecture, engineering, and construction (AEC) domain, simulation modeling is mainly used to facilitate decision-making by enabling the assessment of different operational plans and resource arrangements, that are otherwise difficult (if not impossible), expensive, or time consuming to be evaluated in real world settings. The accuracy of such models directly affects their reliability to serve as a basis for important decisions such as project completion time estimation and resource allocation. Compared to other industries, this is particularly important in construction and infrastructure projects due to the high resource costs and the societal impacts of these projects. Discrete event simulation (DES) is a decision making tool that can benefit the process of design, control, and management of construction operations. Despite recent advancements, most DES models used in construction are created during the early planning and design stage when the lack of factual information from the project prohibits the use of realistic data in simulation modeling. The resulting models, therefore, are often built using rigid (subjective) assumptions and design parameters (e.g. precedence logic, activity durations). In all such cases and in the absence of an inclusive methodology to incorporate real field data as the project evolves, modelers rely on information from previous projects (a.k.a. secondary data), expert judgments, and subjective assumptions to generate simulations to predict future performance. These and similar shortcomings have to a large extent limited the use of traditional DES tools to preliminary studies and long-term planning of construction projects.In the realm of the business process management, process mining as a relatively new research domain seeks to automatically discover a process model by observing activity records and extracting information about processes. The research presented in this Ph.D. Dissertation was in part inspired by the prospect of construction process mining using sensory data collected from field agents. This enabled the extraction of operational knowledge necessary to generate and maintain the fidelity of simulation models. A preliminary study was conducted to demonstrate the feasibility and applicability of data-driven knowledge-based simulation modeling with focus on data collection using wireless sensor network (WSN) and rule-based taxonomy of activities. The resulting knowledge-based simulation models performed very well in properly predicting key performance measures of real construction systems. Next, a pervasive mobile data collection and mining technique was adopted and an activity recognition framework for construction equipment and worker tasks was developed. Data was collected using smartphone accelerometers and gyroscopes from construction entities to generate significant statistical time- and frequency-domain features. The extracted features served as the input of different types of machine learning algorithms that were applied to various construction activities. The trained predictive algorithms were then used to extract activity durations and calculate probability distributions to be fused into corresponding DES models. Results indicated that the generated data-driven knowledge-based simulation models outperform static models created based upon engineering assumptions and estimations with regard to compatibility of performance measure outputs to reality.
Show less - Date Issued
- 2015
- Identifier
- CFE0006023, ucf:51014
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0006023
- Title
- Applying Machine Learning Techniques to Analyze the Pedestrian and Bicycle Crashes at the Macroscopic Level.
- Creator
-
Rahman, Md Sharikur, Abdel-Aty, Mohamed, Eluru, Naveen, Hasan, Samiul, Yan, Xin, University of Central Florida
- Abstract / Description
-
This thesis presents different data mining/machine learning techniques to analyze the vulnerable road users' (i.e., pedestrian and bicycle) crashes by developing crash prediction models at macro-level. In this study, we developed data mining approach (i.e., decision tree regression (DTR) models) for both pedestrian and bicycle crash counts. To author knowledge, this is the first application of DTR models in the growing traffic safety literature at macro-level. The empirical analysis is based...
Show moreThis thesis presents different data mining/machine learning techniques to analyze the vulnerable road users' (i.e., pedestrian and bicycle) crashes by developing crash prediction models at macro-level. In this study, we developed data mining approach (i.e., decision tree regression (DTR) models) for both pedestrian and bicycle crash counts. To author knowledge, this is the first application of DTR models in the growing traffic safety literature at macro-level. The empirical analysis is based on the Statewide Traffic Analysis Zones (STAZ) level crash count data for both pedestrian and bicycle from the state of Florida for the year of 2010 to 2012. The model results highlight the most significant predictor variables for pedestrian and bicycle crash count in terms of three broad categories: traffic, roadway, and socio demographic characteristics. Furthermore, spatial predictor variables of neighboring STAZ were utilized along with the targeted STAZ variables in order to improve the prediction accuracy of both DTR models. The DTR model considering spatial predictor variables (spatial DTR model) were compared without considering spatial predictor variables (aspatial DTR model) and the models comparison results clearly found that spatial DTR model is superior model compared to aspatial DTR model in terms of prediction accuracy. Finally, this study contributed to the safety literature by applying three ensemble techniques (Bagging, Random Forest, and Boosting) in order to improve the prediction accuracy of weak learner (DTR models) for macro-level crash count. The model's estimation result revealed that all the ensemble technique performed better than the DTR model and the gradient boosting technique outperformed other competing ensemble technique in macro-level crash prediction model.
Show less - Date Issued
- 2018
- Identifier
- CFE0007358, ucf:52103
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0007358
- Title
- Batch and Online Implicit Weighted Gaussian Processes for Robust Novelty Detection.
- Creator
-
Ramirez Padron, Ruben, Gonzalez, Avelino, Georgiopoulos, Michael, Stanley, Kenneth, Mederos, Boris, Wang, Chung-Ching, University of Central Florida
- Abstract / Description
-
This dissertation aims mainly at obtaining robust variants of Gaussian processes (GPs) that do not require using non-Gaussian likelihoods to compensate for outliers in the training data. Bayesian kernel methods, and in particular GPs, have been used to solve a variety of machine learning problems, equating or exceeding the performance of other successful techniques. That is the case of a recently proposed approach to GP-based novelty detection that uses standard GPs (i.e. GPs employing...
Show moreThis dissertation aims mainly at obtaining robust variants of Gaussian processes (GPs) that do not require using non-Gaussian likelihoods to compensate for outliers in the training data. Bayesian kernel methods, and in particular GPs, have been used to solve a variety of machine learning problems, equating or exceeding the performance of other successful techniques. That is the case of a recently proposed approach to GP-based novelty detection that uses standard GPs (i.e. GPs employing Gaussian likelihoods). However, standard GPs are sensitive to outliers in training data, and this limitation carries over to GP-based novelty detection. This limitation has been typically addressed by using robust non-Gaussian likelihoods. However, non-Gaussian likelihoods lead to analytically intractable inferences, which require using approximation techniques that are typically complex and computationally expensive. Inspired by the use of weights in quasi-robust statistics, this work introduces a particular type of weight functions, called here data weighers, in order to obtain robust GPs that do not require approximation techniques and retain the simplicity of standard GPs. This work proposes implicit weighted variants of batch GP, online GP, and sparse online GP (SOGP) that employ weighted Gaussian likelihoods. Mathematical expressions for calculating the posterior implicit weighted GPs are derived in this work. In our experiments, novelty detection based on our weighted batch GPs consistently and significantly outperformed standard batch GP-based novelty detection whenever data was contaminated with outliers. Additionally, our experiments show that novelty detection based on online GPs can perform similarly to batch GP-based novelty detection. Membership scores previously introduced by other authors are also compared in our experiments.
Show less - Date Issued
- 2015
- Identifier
- CFE0005869, ucf:50858
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0005869