Euclidean Sum of Squares Clustering Problems


Problem Description | Code | Results | Fisher's Iris Dataset | Ruspini's Dataset | German Towns (Spath) Dataset

Problem Description

Clustering (see, e.g. [Steinley 2006, Xu and Wunsch 2005]) is a fundamental problem in data analysis. Given a finite dataset



of n data points located in p-dimensional continuous space together with a second set of k points (aka cluster centers),



the clustering problem is to determine the positions of the cluster centers such that the mean sum of Euclidean distances (L2 Norm) between each data point and its nearest cluster center is minimized:



where



This is an unconstrained, continuous optimization problem of dimensionality kp. A candidate solution vector, x' can be represented by concatenating the p-dimensional coordinates of the cluster centers:



An instance of this problem type is uniquely specified by a dataset and choosing a value of k. For any problem instance,



f=0 being the degenerate case when all data points are located exactly at one of the cluster centers. If the coordinates of each data point are finite, then a boundary (e.g. square) can be defined containing the dataset. This boundary implements an automatic penalisation on f; since no data points occur outside the boundary, f must increase monotonically if one or more cluster centers are moved outside and away from the boundary, with



Note also that the value of f will only change when the "ownership" of one or more data points changes due to a movement of one or more cluster centres. This means that the solution vector can be changed without changing the value of f. Therefore, the fitness landscape consists of a very large number of flat regions and discrete transitions between them. The scale of this structure however depends on the data and k.

For algorithm benchmarking, it is useful to specify an initial search region in which the global optimum is located. For clustering problems, we can define a simple symmetric square boundary as



It may be possible to define a tighter initial search region, e.g. as a rectangular region using the range of the data points in each dimension. In any case these values are part of the problem specification.


Code

Some example Matlab code is provided here. No claims are made regarding the efficiency or elegance of this code!


Results

A first set of results on these problem instances (using CMA-ES, Nelder-Mead simplex, random search and k-means) can be found in this paper:


Problem Instance Families

Fisher's Iris Data

One of the most widely-used benchmark datasets in clustering is Fisher's Iris dataset [Fisher, 1936]. This dataset has 150 observations of 4 variables, i.e. n=150, p=4. k can be varied: since the data are observations of 3 species/classes of flowers, k=3 is one obvious choice, specifying a 12-dimensional optimization problem. The range of data values over all variables is [0.1,7.9] therefore we can define the initial search region as [0.1,7.9]^4k.

Dataset. In Matlab, the dataset is available ("load fisheriris"). Note that this is NOT identical to the UCI Machine Learning Repository version of the dataset [Bezdek et al., 1999].

Problem Family Specification: <essclust,iris,(feasible bounds),(2pk)>

Instances

<essclust,iris,[0.1,7.9]^4,8> (k=2)

<essclust,iris,[0.1,7.9]^4,12> (k=3)

<essclust,iris,[0.1,7.9]^4,16> (k=4)

<essclust,iris,[0.1,7.9]^4,20> (k=5)

<essclust,iris,[0.1,7.9]^4,24> (k=6)

<essclust,iris,[0.1,7.9]^4,28> (k=7)

<essclust,iris,[0.1,7.9]^4,32> (k=8)

<essclust,iris,[0.1,7.9]^4,36> (k=9)

<essclust,iris,[0.1,7.9]^4,40> (k=10)


Ruspini's Data

This dataset has 75 observations of 2 variables, i.e. n=75, p=2. k can be varied. The range of data values over all variables is [4,156] therefore we can define the initial search region as [4,156]^2k.

Dataset (downloaded from the http://www.unc.edu/~rls/s754/data/ruspini.txt, 5/12/2013). The dataset is also available in R (cluster package).

Problem Family Specification: <essclust,ruspini,(feasible bounds),(2pk)>

Instances

<essclust,ruspini,[4,156]^2,4> (k=2)

<essclust,ruspini,[4,156]^2,6> (k=3)

<essclust,ruspini,[4,156]^2,8> (k=4)

<essclust,ruspini,[4,156]^2,10> (k=5)

<essclust,ruspini,[4,156]^2,12> (k=6)

<essclust,ruspini,[4,156]^2,14> (k=7)

<essclust,ruspini,[4,156]^2,16> (k=8)

<essclust,ruspini,[4,156]^2,18> (k=9)

<essclust,ruspini,[4,156]^2,20> (k=10)


German Towns (Spath) Data

This dataset has 89 observations of 3 variables, i.e. n=89, p=3. k can be varied. The range of data values over all variables is [24.49,1306024] therefore we can define the initial search region as [24.49,1306024]^3k.

Dataset (downloaded from http://people.sc.fsu.edu/~jburkardt/datasets/spaeth2/spaeth2_07.txt, 3/9/2014).

Problem Family Specification: <essclust,germantowns,(feasible bounds),(2pk)>

Instances

<essclust,germantowns,[24.49,1306024]^3,6> (k=2)

<essclust,germantowns,[24.49,1306024]^3,9> (k=3)

<essclust,germantowns,[24.49,1306024]^3,12> (k=4)

<essclust,germantowns,[24.49,1306024]^3,15> (k=5)

<essclust,germantowns,[24.49,1306024]^3,18> (k=6)

<essclust,germantowns,[24.49,1306024]^3,21> (k=7)

<essclust,germantowns,[24.49,1306024]^3,24> (k=8)

<essclust,germantowns,[24.49,1306024]^3,27> (k=9)

<essclust,germantowns,[24.49,1306024]^3,30> (k=10)


References

Bagirov, A.M. (2008). Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recognition 41(10), 3192-3199.
Bezdek, J. C. and Keller, J. M. and Krishnapuram, R. and Kuncheva, L. I. and Pal, N. R. Will the real iris data please stand up?, IEEE Transactions on Fuzzy Systems 7(3):368-369, 1999.
Du Merle, O., Hansen, P., Jaumard, B., & Mladenovic, N. (2000). An interior point algorithm for minimum sum-of-squares clustering. SIAM Journal on Scientific Computing, 21(4), 1485-1505.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179-188.
Hansen, P., Mladenovic, N. (2001). J-means: a new local search heuristic for minimum sum of squares clustering. Pattern recognition 34(2), 405-413.
Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59(1), 1-34.
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3), 645-678.

Up: Continuous Real-World-Representative Problem Repository


Last modified: 30/01/17. Marcus Gallagher