Month: October 2014

GRAPH BASED PARAMETER-LITE CLUSTERING ALGORITHM USING EUCLIDEAN AND MAHALANOBIS DISTANCE METRICS

B.H.V.S. Ramakrishnam Raju

Awarded 2014

Data mining technology has emerged as a means for identifying patterns and trends from large quantities of data. Clustering groups most similar data and has wide range of applications. Clustering algorithms are classified into partition algorithms, hierarchical algorithms, density and grid based algorithms etc. Partition algorithms are simple and efficient but are sensitive to the noise present in the dataset. A problem accompanying the use of a partition algorithm is the choice of the number of desired output clusters. Hierarchical algorithms can detect clusters with different shapes, but these algorithms demand high computational cost. The density-based and grid based clustering algorithms suitably handle arbitrary shaped collections of points as well as clusters of different sizes. They can also efficiently separate noise. However these algorithms are very sensitive to the initial setting of parameters i.e. the desired number of clusters. Hence main problem with clustering is to decide the optimal number of clusters that fits a data set.

As a consequence, if the parameters are not set properly, the clustering method may not result in an optimal partitional scheme and may lead to improper decisions. The problem of deciding the number of clusters better fitting a given dataset has been the subject of this research. To overcome the problems associated with initial setting of parameters, a novel clustering algorithm namely PLMST (Parameter-Lite Minimum Spanning Tree) based on minimum spanning tree (a connected acyclic graph) split and fuzzy similarity based merging technique is proposed in this thesis. The proposed algorithm does not require the number of clusters to be set initially. The algorithm starts by creating the minimum spanning tree for the given dataset and in each iteration the inconsistent edge is removed until the termination criteria is reached. This gives rough estimation of number of clusters present in the dataset. The number of clusters is then fine tuned by eliminating the outlier clusters, which are defined based on the number of data items in those clusters. Once these clusters are known, the optimal number of clusters is generated using fuzzy based similarity merging with minimum effort.

The artificial (synthetic) and real-life data sets are considered for testing the performance of the proposed algorithm. The basis for selecting the synthetic datasets is that they can be designed to contain a certain number of clusters and are easy to manage. The 2-D synthetic or artificial datasets, called Data_A, Data_B and a bench mark dataset called Ruspini are used for testing the algorithm. The real datasets tested in our implementations are Iris, SPECT (Heart), Statlog , Soybean, Breast Tissue and Wine datasets which are obtained form the UCI Machine Learning Repository web site. The parameters required for the proposed algorithm are the number of data points in a cluster to find an outlier cluster and the fuzzifier used for fuzzy clustering. It works on numerical data and assumes that the graph can be calculated in a vector space.

Experiments on both the synthetic datasets and real datasets demonstrated that the proposed algorithm can discover the patterns automatically and also can compete with traditional clustering algorithms with minimum user intervention. The similarity measure used in the proposed algorithm is based on Euclidean distance. The Euclidean distance is very sensitive to scales of variables involved and independent of correlated variables. To conquer these drawbacks a hybrid clustering algorithm based on Mahalanobis distance is proposed. To evaluate the proposed methodology, the performance of the proposed algorithm is also compared to outcomes received for the same datasets by means of several other clustering approaches like standard K-means, FCM, Gustafson and Kessel (GK) and Gath-Geva (GG) clustering algorithms.

An important direction for further study is to study the application of the PLMST algorithm to the more general case, where the number of clusters and the threshold value must also be solved. The number of iterations must also be controlled.