Current Search: Distributed Data Mining (x)
View All Items
- Title
- AN ARCHITECTURE FOR HIGH-PERFORMANCE PRIVACY-PRESERVING AND DISTRIBUTED DATA MINING.
- Creator
-
Secretan, James, Georgiopoulos, Michael, University of Central Florida
- Abstract / Description
-
This dissertation discusses the development of an architecture and associated techniques to support Privacy Preserving and Distributed Data Mining. The field of Distributed Data Mining (DDM) attempts to solve the challenges inherent in coordinating data mining tasks with databases that are geographically distributed, through the application of parallel algorithms and grid computing concepts. The closely related field of Privacy Preserving Data Mining (PPDM) adds the dimension of privacy to...
Show moreThis dissertation discusses the development of an architecture and associated techniques to support Privacy Preserving and Distributed Data Mining. The field of Distributed Data Mining (DDM) attempts to solve the challenges inherent in coordinating data mining tasks with databases that are geographically distributed, through the application of parallel algorithms and grid computing concepts. The closely related field of Privacy Preserving Data Mining (PPDM) adds the dimension of privacy to the problem, trying to find ways that organizations can collaborate to mine their databases collectively, while at the same time preserving the privacy of their records. Developing data mining algorithms for DDM and PPDM environments can be difficult and there is little software to support it. In addition, because these tasks can be computationally demanding, taking hours of even days to complete data mining tasks, organizations should be able to take advantage of high-performance and parallel computing to accelerate these tasks. Unfortunately there is no such framework that is able to provide all of these services easily for a developer. In this dissertation such a framework is developed to support the creation and execution of DDM and PPDM applications, called APHID (Architecture for Private, High-performance Integrated Data mining). The architecture allows users to flexibly and seamlessly integrate cluster and grid resources into their DDM and PPDM applications. The architecture is scalable, and is split into highly de-coupled services to ensure flexibility and extensibility. This dissertation first develops a comprehensive example algorithm, a privacy-preserving Probabilistic Neural Network (PNN), which serves a basis for analysis of the difficulties of DDM/PPDM development. The privacy-preserving PNN is the first such PNN in the literature, and provides not only a practical algorithm ready for use in privacy-preserving applications, but also a template for other data intensive algorithms, and a starting point for analyzing APHID's architectural needs. After analyzing the difficulties in the PNN algorithm's development, as well as the shortcomings of researched systems, this dissertation presents the first concrete programming model joining high performance computing resources with a privacy preserving data mining process. Unlike many of the existing PPDM development models, the platform of services is language independent, allowing layers and algorithms to be implemented in popular languages (Java, C++, Python, etc.). An implementation of a PPDM algorithm is developed in Java utilizing the new framework. Performance results are presented, showing that APHID can enable highly simplified PPDM development while speeding up resource intensive parts of the algorithm.
Show less - Date Issued
- 2009
- Identifier
- CFE0002853, ucf:48076
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002853
- Title
- SCALABLE AND EFFICIENT OUTLIER DETECTION IN LARGE DISTRIBUTED DATA SETS WITH MIXED-TYPE ATTRIBUTES.
- Creator
-
Koufakou, Anna, Georgiopoulos, Michael, University of Central Florida
- Abstract / Description
-
An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For...
Show moreAn important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For example, in credit card transaction data, outliers might indicate potential fraud; in network traffic data, outliers might represent potential intrusion attempts. The basis of deciding if a data point is an outlier is often some measure or notion of dissimilarity between the data point under consideration and the rest. Traditional outlier detection methods assume numerical or ordinal data, and compute pair-wise distances between data points. However, the notion of distance or similarity for categorical data is more difficult to define. Moreover, the size of currently available data sets dictates the need for fast and scalable outlier detection methods, thus precluding distance computations. Additionally, these methods must be applicable to data which might be distributed among different locations. In this work, we propose novel strategies to efficiently deal with large distributed data containing mixed-type attributes. Specifically, we first propose a fast and scalable algorithm for categorical data (AVF), and its parallel version based on MapReduce (MR-AVF). We extend AVF and introduce a fast outlier detection algorithm for large distributed data with mixed-type attributes (ODMAD). Finally, we modify ODMAD in order to deal with very high-dimensional categorical data. Experiments with large real-world and synthetic data show that the proposed methods exhibit large performance gains and high scalability compared to the state-of-the-art, while achieving similar accuracy detection rates.
Show less - Date Issued
- 2009
- Identifier
- CFE0002734, ucf:48161
- Format
- Document (PDF)
- PURL
- http://purl.flvc.org/ucf/fd/CFE0002734