22 Mar 2020 - tsp
Last update 19 Aug 2020
4 mins
Since I’ve written a short summary about neuronal network basics a few people / students have also shown some interest in different clustering methods. One of the most simple vector compression or clustering methods is the k-means algorithm.
k-Means is used really often used for stuff like object recognition, recommender systems, etc. - anything that does categorical association. It’s one of the more important algorithms in data analysis and data mining. It’s one of the most simple to implement algorithms - and also one that’s really powerful. In many situations it can provide similar or better performance as neuronal networks. It’s also used heavily in image recognition in models like bag-of-words, etc.
The basic idea behind k-Means is to locate $k$ (hence the name) cluster centers that are located in the mean - one can also imagine at the center of mass - of a group of data points. This means that every point is assigned to a given cluster center which is located exactly in the center of mass of all points assigned to a given cluster.
The k-Means algorithm is used to locate these centers of mass, it in fact minimizes the function
[ E = \sum_{n \in Clusters} \sum_{i \in Cluster} (x_{i;n} - \bar{x}_n)^2 ]I.e. it minimizes the distance between all nodes and their associated cluster centers. The easiest way to achieved this behavior is the Lloyd algorithm:
Of course the choice of initial positions has a huge impact on the quality of cluster centers. In the most basic algorithm one might try multiple iterations with different random starting values for the cluster centers.
At the end one should calculate the variance of each of the cluster centers to get a measure for the quality of the clusters. When now one wants to do categorization one can simply
There are some modifications to this algorithm:
This is simply an extension to the initial center guessing / determination to select clusters that have a high probability of leading to a useful result.
This is an algorithm that can be used when one doesn’t know how many cluster centers one wants to use. This method starts with $k=2$ clusters. After the initial run of the k-means or k-means++ algorithm one splits the largest cluster and repeats the same procedure. Then one splits again, etc. till either the separation provided by the clusters (determined by their variance and overlap) is not usable anymore or the desired amount of clusters has been found.
There is an additional method called X-means that works the same way but without any limit on $k$ but with other criteria that decide when one should stop splitting (i.e. when there’s no more information gain from the additional clusters).
This article is tagged: Home automation, Programming, Statistics, Math, Data Mining
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/