Current Search: data clustering (x)
View All Items
 Title
 LEARNING TECHNIQUES FOR INFORMATION RETRIEVAL AND MINING IN HIGHDIMENSIONAL DATABASES.
 Creator

Cheng, Hao, Hua, Kien A., University of Central Florida
 Abstract / Description

The main focus of my research is to design effective learning techniques for information retrieval and mining in highdimensional databases. There are two main aspects in the retrieval and mining research: accuracy and efficiency. The accuracy problem is how to return results which can better match the ground truth, and the efficiency problem is how to evaluate users' requests and execute learning algorithms as fast as possible. However, these problems are nontrivial because of the...
Show moreThe main focus of my research is to design effective learning techniques for information retrieval and mining in highdimensional databases. There are two main aspects in the retrieval and mining research: accuracy and efficiency. The accuracy problem is how to return results which can better match the ground truth, and the efficiency problem is how to evaluate users' requests and execute learning algorithms as fast as possible. However, these problems are nontrivial because of the complexity of the highlevel semantic concepts, the heterogeneous natures of the feature space, the high dimensionality of data representations and the size of the databases. My dissertation is dedicated to addressing these issues. Specifically, my work has five main contributions as follows. The first contribution is a novel manifold learning algorithm, Local and Global Structures Preserving Projection (LGSPP), which defines salient lowdimensional representations for the highdimensional data. A small number of projection directions are sought in order to properly preserve the local and global structures for the original data. Specifically, two groups of points are extracted for each individual point in the dataset: the first group contains the nearest neighbors of the point, and the other set are a few sampled points far away from the point. These two point sets respectively characterize the local and global structures with regard to the data point. The objective of the embedding is to minimize the distances of the points in each local neighborhood and also to disperse the points far away from their respective remote points in the original space. In this way, the relationships between the data in the original space are well preserved with little distortions. The second contribution is a new constrained clustering algorithm. Conventionally, clustering is an unsupervised learning problem, which systematically partitions a dataset into a small set of clusters such that data in each cluster appear similar to each other compared with those in other clusters. In the proposal, the partial human knowledge is exploited to find better clustering results. Two kinds of constraints are integrated into the clustering algorithm. One is the mustlink constraint, indicating that the involved two points belong to the same cluster. On the other hand, the cannotlink constraint denotes that two points are not within the same cluster. Given the input constraints, data points are arranged into small groups and a graph is constructed to preserve the semantic relations between these groups. The assignment procedure makes a best effort to assign each group to a feasible cluster without violating the constraints. The theoretical analysis reveals that the probability of data points being assigned to the true clusters is much higher by the new proposal, compared to conventional methods. In general, the new scheme can produce clusters which can better match the ground truth and respect the semantic relations between points inferred from the constraints. The third contribution is a unified framework for partitionbased dimension reduction techniques, which allows efficient similarity retrieval in the highdimensional data space. Recent similarity search techniques, such as Piecewise Aggregate Approximation (PAA), Segmented Means (SMEAN) and MeanStandard deviation (MS), prove to be very effective in reducing data dimensionality by partitioning dimensions into subsets and extracting aggregate values from each dimension subset. These partitionbased techniques have many advantages including very efficient multiphased pruning while being simple to implement. They, however, are not adaptive to different characteristics of data in diverse applications. In this study, a unified framework for these partitionbased techniques is proposed and the issue of dimension partitions is examined in this framework. An investigation of the relationships of query selectivity and the dimension partition schemes discovers indicators which can predict the performance of a partitioning setting. Accordingly, a greedy algorithm is designed to effectively determine a good partitioning of data dimensions so that the performance of the reduction technique is robust with regard to different datasets. The fourth contribution is an effective similarity search technique in the database of point sets. In the conventional model, an object corresponds to a single vector. In the proposed study, an object is represented by a set of points. In general, this new representation can be used in many realworld applications and carries much more local information, but the retrieval and learning problems become very challenging. The Hausdorff distance is the common distance function to measure the similarity between two point sets, however, this metric is sensitive to outliers in the data. To address this issue, a novel similarity function is defined to better capture the proximity of two objects, in which a onetoone mapping is established between vectors of the two objects. The optimal mapping minimizes the sum of distances between each paired points. The overall distance of the optimal matching is robust and has high retrieval accuracy. The computation of the new distance function is formulated into the classical assignment problem. The lowerbounding techniques and earlystop mechanism are also proposed to significantly accelerate the expensive similarity search process. The classification problem over the pointset data is called Multiple Instance Learning (MIL) in the machine learning community in which a vector is an instance and an object is a bag of instances. The fifth contribution is to convert the MIL problem into a standard supervised learning in the conventional vector space. Specially, feature vectors of bags are grouped into clusters. Each object is then denoted as a bag of cluster labels, and common patterns of each category are discovered, each of which is further reconstructed into a bag of features. Accordingly, a bag is effectively mapped into a feature space defined by the distances from this bag to all the derived patterns. The standard supervised learning algorithms can be applied to classify objects into predefined categories. The results demonstrate that the proposal has better classification accuracy compared to other stateoftheart techniques. In the future, I will continue to explore my research in largescale data analysis algorithms, applications and system developments. Especially, I am interested in applications to analyze the massive volume of online data.
Show less  Date Issued
 2009
 Identifier
 CFE0002882, ucf:48022
 Format
 Document (PDF)
 PURL
 http://purl.flvc.org/ucf/fd/CFE0002882
 Title
 An Unsupervised Consensus Control Chart Pattern Recognition Framework.
 Creator

Haghtalab, Siavash, Xanthopoulos, Petros, Pazour, Jennifer, Rabelo, Luis, University of Central Florida
 Abstract / Description

Early identification and detection of abnormal time series patterns is vital for a number of manufacturing.Slide shifts and alterations of time series patterns might be indicative of some anomalyin the production process, such as machinery malfunction. Usually due to the continuous flow of data monitoring of manufacturing processes requires automated Control Chart Pattern Recognition(CCPR) algorithms. The majority of CCPR literature consists of supervised classification algorithms. Less...
Show moreEarly identification and detection of abnormal time series patterns is vital for a number of manufacturing.Slide shifts and alterations of time series patterns might be indicative of some anomalyin the production process, such as machinery malfunction. Usually due to the continuous flow of data monitoring of manufacturing processes requires automated Control Chart Pattern Recognition(CCPR) algorithms. The majority of CCPR literature consists of supervised classification algorithms. Less studies consider unsupervised versions of the problem. Despite the profound advantageof unsupervised methodology for less manual data labeling their use is limited due to thefact that their performance is not robust enough for practical purposes. In this study we propose the use of a consensus clustering framework. Computational results show robust behavior compared to individual clustering algorithms.
Show less  Date Issued
 2014
 Identifier
 CFE0005178, ucf:50670
 Format
 Document (PDF)
 PURL
 http://purl.flvc.org/ucf/fd/CFE0005178
 Title
 Sampling and Subspace Methods for Learning Sparse Group Structures in Computer Vision.
 Creator

Jaberi, Maryam, Foroosh, Hassan, Pensky, Marianna, Gong, Boqing, Qi, GuoJun, Pensky, Marianna, University of Central Florida
 Abstract / Description

The unprecedented growth of data in volume and dimension has led to an increased number of computationallydemanding and datadriven decisionmaking methods in many disciplines, such as computer vision, genomics, finance, etc. Research on big data aims to understand and describe trends in massive volumes of highdimensional data. High volume and dimension are the determining factors in both computational and time complexity of algorithms. The challenge grows when the data are formed of the...
Show moreThe unprecedented growth of data in volume and dimension has led to an increased number of computationallydemanding and datadriven decisionmaking methods in many disciplines, such as computer vision, genomics, finance, etc. Research on big data aims to understand and describe trends in massive volumes of highdimensional data. High volume and dimension are the determining factors in both computational and time complexity of algorithms. The challenge grows when the data are formed of the union of groupstructures of different dimensions embedded in a highdimensional ambient space.To address the problem of high volume, we propose a sampling method referred to as the Sparse Withdrawal of Inliers in a First Trial (SWIFT), which determines the smallest sample size in one grab so that all groupstructures are adequately represented and discovered with high probability. The key features of SWIFT are: (i) sparsity, which is independent of the population size; (ii) no prior knowledge of the distribution of data, or the number of underlying groupstructures; and (iii) robustness in the presence of an overwhelming number of outliers. We report a comprehensive study of the proposed sampling method in terms of accuracy, functionality, and effectiveness in reducing the computational cost in various applications of computer vision. In the second part of this dissertation, we study dimensionality reduction for multistructural data. We propose a probabilistic subspace clustering method that unifies soft and hardclustering in a single framework. This is achieved by introducing a delayed association of uncertain points to subspaces of lower dimensions based on a confidence measure. Delayed association yields higher accuracy in clustering subspaces that have ambiguities, i.e. due to intersections and highlevel of outliers/noise, and hence leads to more accurate selfrepresentation of underlying subspaces. Altogether, this dissertation addresses the key theoretical and practically issues of size and dimension in big data analysis.
Show less  Date Issued
 2018
 Identifier
 CFE0007017, ucf:52039
 Format
 Document (PDF)
 PURL
 http://purl.flvc.org/ucf/fd/CFE0007017
 Title
 PARTITIONING A GRAPH IN ALLIANCES AND ITS APPLICATION TO DATA CLUSTERING.
 Creator

HassanShafique, Khurram, Dutton, Ronald, University of Central Florida
 Abstract / Description

Any reasonably large group of individuals, families, states, and parties exhibits the phenomenon of subgroup formations within the group such that the members of each group have a strong connection or bonding between each other. The reasons of the formation of these subgroups that we call alliances differ in different situations, such as, kinship and friendship (in the case of individuals), common economic interests (for both individuals and states), common political interests, and...
Show moreAny reasonably large group of individuals, families, states, and parties exhibits the phenomenon of subgroup formations within the group such that the members of each group have a strong connection or bonding between each other. The reasons of the formation of these subgroups that we call alliances differ in different situations, such as, kinship and friendship (in the case of individuals), common economic interests (for both individuals and states), common political interests, and geographical proximity. This structure of alliances is not only prevalent in social networks, but it is also an important characteristic of similarity networks of natural and unnatural objects. (A similarity network defines the links between two objects based on their similarities). Discovery of such structure in a data set is called clustering or unsupervised learning and the ability to do it automatically is desirable for many applications in the areas of pattern recognition, computer vision, artificial intelligence, behavioral and social sciences, life sciences, earth sciences, medicine, and information theory. In this dissertation, we study a graph theoretical model of alliances where an alliance of the vertices of a graph is a set of vertices in the graph, such that every vertex in the set is adjacent to equal or more vertices inside the set than the vertices outside it. We study the problem of partitioning a graph into alliances and identify classes of graphs that have such a partition. We present results on the relationship between the existence of such a partition and other well known graph parameters, such as connectivity, subgraph structure, and degrees of vertices. We also present results on the computational complexity of finding such a partition. An alliance cover set is a set of vertices in a graph that contains at least one vertex from every alliance of the graph. The complement of an alliance cover set is an alliance free set, that is, a set that does not contain any alliance as a subset. We study the properties of these sets and present tight bounds on their cardinalities. In addition, we also characterize the graphs that can be partitioned into alliance free and alliance cover sets. Finally, we present an approximate algorithm to discover alliances in a given graph. At each step, the algorithm finds a partition of the vertices into two alliances such that the alliances are strongest among all such partitions. The strength of an alliance is defined as a real number p, such that every vertex in the alliance has at least p times more neighbors in the set than its total number of neighbors in the graph). We evaluate the performance of the proposed algorithm on standard data sets.
Show less  Date Issued
 2004
 Identifier
 CFE0000263, ucf:46225
 Format
 Document (PDF)
 PURL
 http://purl.flvc.org/ucf/fd/CFE0000263