K-MEANS CLUSTERING – NEW VARIATIONS

Andrzej Sokolowski; Malgorzata Markowska; Maciej Laburda

Open Conference Systems, CLADAG2023

Andrzej Sokolowski, Malgorzata Markowska, Maciej Laburda

Last modified: 2023-07-07

Abstract

k-means is one of the most popular methods in cluster analysis. It can handle the large set of data since there is no need to store the distance matrix in the memory, and the algorithm converges very quickly to the situation when no object should be relocated (each one is closer to the mean of its “own” cluster, that to the other one). Two main drawbacks of the method are that the number of clusters should be defined properly and that the final partition tends to be formed by spherical clusters. In the literature, there are many variations, improvements, and new versions of k-means based on the original model.

In this contribution we discuss two new ideas. The first one can be called n%-neighbors k-means. When we have to decide whether an object should be relocated to another cluster, we consider only some percentage of the total set of objects, only points closest to the one which is considering at the moment. So partial means should be calculated and considered. It is possible that some distance clusters will not be taken into account if their members are not included in n% nearest neighbors of this point. The second new proposition can be called local standardization k-means. Standarization is performed separately for each cluster, using its mean and standard deviation, excluding point which is considered for relocation. Than this point is “standardized” using means and standard deviations of consecutive clusters and distances are calculated.

Simulation analysis is the main tool to evaluate the quality of the proposed approaches