Quantification Of Variation Of Results Of The K-Means Clustering Algorithm Run Ten Times On The First Twenty Five Primes – A Criterion For Applicability Of The K-Means Clustering
Chandini Giri, R Satya Ravindra Babu
K-Means Clustering, Clustering Uncertainty
Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific. K-means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. K - Means clustering algorithm is a scheme for clustering continuous and numeric data. As K-Means algorithm consists of scheme of random initialization of centroids, every time it is run, it gives different or slightly different results because it may reach some local optima. Quantification of such aforementioned variation is of some importance as this sheds light on the nature of the Discrete K-Means Objective function with regards its maxima and minima. The K-Means Clustering algorithm aims at minimizing the aforementioned Objective function. In this research investigation, the author has attempted to quantify the variation of results of the K-Means Clustering Algorithm, run 10 times on the first 25 Prime numbers. Also, a notion of Percentage Uncertainty of clustering assignment for each data set point is computed for each run of the K- Means Clustering Algorithm. Also, a criterion is proposed for the applicability of K-Means Clustering Algorithm for the given data set.