K-means is certainly the most useful algorithm for clustering datasets.
However, we need to give a prescribed number of clusters.
One wat to circumvent this is to use penalty criteria (like AIC or BIC) or use the MDL principle.
A simpler solution is to run the Anderson-Darling 1D test on projected data as proned by Hamerly and Elkan (NIPS*2003)
Learning the k in k -means
However the test is not fully dimensional. Another interesting approach is based on the minimum spanning tree of the source dataset and a pooled sample (with parameters estimated from the sample mean and sample variance covariance):
The null hypothesis testing algorithm runs in quadratic time.
It is suprising to see that the paper has not been mentioned more in the literature (closer works with MST and entropy are those of A. Hero). If you give it a try, let me know -:)
Frank.
K-means is certainly the most useful algorithm for clustering datasets. However, we need to give a prescribed number of clusters. One wat to circumvent this is to use penalty criteria (like AIC or BIC) or use the MDL principle. A simpler solution is to run the Anderson-Darling 1D test on projected data as proned by Hamerly and Elkan (NIPS*2003) Learning the k in k -means
However the test is not fully dimensional. Another interesting approach is based on the minimum spanning tree of the source dataset and a pooled sample (with parameters estimated from the sample mean and sample variance covariance):
A Test to Determine the Multivariate Normality of a Data Set (PAMI 1988).
The null hypothesis testing algorithm runs in quadratic time. It is suprising to see that the paper has not been mentioned more in the literature (closer works with MST and entropy are those of A. Hero). If you give it a try, let me know -:)
Frank.