Euclidean Sum of Squares Clustering Problems
Problem Description |
Code |
Results |
Fisher's Iris Dataset |
Ruspini's Dataset |
German Towns (Spath) Dataset
Problem Description
Clustering (see, e.g. [Steinley 2006, Xu and Wunsch 2005]) is a fundamental problem in data analysis. Given a finite dataset
of n data points located in p-dimensional continuous space together with a second set of k points (aka cluster centers),
the clustering problem is to determine the positions of the cluster centers such that the mean sum of Euclidean distances (L2 Norm) between each data point and its nearest cluster center is minimized:
where
This is an unconstrained, continuous optimization problem of dimensionality kp. A candidate solution vector, x' can be represented by concatenating the p-dimensional coordinates of the cluster centers:
An instance of this problem type is uniquely specified by a dataset and choosing a value of k. For any problem instance,
f=0 being the degenerate case when all data points are located exactly at one of the cluster centers. If the coordinates of each data point are finite, then a boundary (e.g. square) can be defined containing the dataset. This boundary implements an automatic penalisation on f; since no data points occur outside the boundary, f must increase monotonically if one or more cluster centers are moved outside and away from the boundary, with
Note also that the value of f will only change when the "ownership" of one or more data points changes due to a movement of one or more cluster centres. This means that the solution vector can be changed without changing the value of f. Therefore, the fitness landscape consists of a very large number of flat regions and discrete transitions between them. The scale of this structure however depends on the data and k.
For algorithm benchmarking, it is useful to specify an initial search region in which the global optimum is located. For clustering problems, we can define a simple symmetric square boundary as
It may be possible to define a tighter initial search region, e.g. as a rectangular region using the range of the data points in each dimension. In any case these values are part of the problem specification.
Code
Some example Matlab code is provided here. No claims are made regarding the efficiency or elegance of this code!
- fitnessclustsse.m - This function implements the Euclidean sum of squares objective function for the clustering problems. Given: a dataset; value for k (the number of clusters); candidate solution vector. Returns: the objective function value.
- Usage example (requires data file german_postal.mat) - calculate fitness value of the global optimum for k=2:
>> load german_postal;
>> ggg = [722.944520639560e+000 95.9721033505540e+003 165.042584998811e+000 657.585035229536e+000 995.903499354406e+003 2.32154979550654e+003]';
>> fitnessclustsse(ggg,data)
ans =
6.025472220938822e+11
- cmaes_demo.m - A simple script that applies the Matlab implementation of CMA-ES (available here) to a set of problem instances using the German Towns/Postal data.
Results
A first set of results on these problem instances (using CMA-ES, Nelder-Mead simplex, random search and k-means) can be found in this paper:
Problem Instance Families
Fisher's Iris Data
One of the most widely-used benchmark datasets in clustering is Fisher's Iris dataset [Fisher, 1936]. This dataset has 150 observations of 4 variables, i.e. n=150, p=4. k can be varied: since the data are observations of 3 species/classes of flowers, k=3 is one obvious choice, specifying a 12-dimensional optimization problem. The range of data values over all variables is [0.1,7.9] therefore we can define the initial search region as [0.1,7.9]^4k.
Dataset. In Matlab, the dataset is available ("load fisheriris"). Note that this is NOT identical to the UCI Machine Learning Repository version of the dataset [Bezdek et al., 1999].
Problem Family Specification: <essclust,iris,(feasible bounds),(2pk)>
Instances
<essclust,iris,[0.1,7.9]^4,8> (k=2)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 152.347.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 152.347951760358
- Solution vector:
5.00566037735849 3.36981132075472 1.56037735849057 0.290566037735849
6.30103092783505 2.88659793814433 4.95876288659794 1.69587628865979
<essclust,iris,[0.1,7.9]^4,12> (k=3)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 78.8514. This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 78.8514414261460
- Solution vector:
5.006000000000001 3.428000000000000 1.462000000000000 0.246000000000000
5.901612903225806 2.748387096774194 4.393548387096774 1.433870967741936
6.849999999999999 3.073684210526315 5.742105263157895 2.071052631578947
<essclust,iris,[0.1,7.9]^4,16> (k=4)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 57.2284.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 57.2284732142857
- Solution vector:
5.00600000000000 3.42800000000000 1.46200000000000 0.24600000000000
5.53214285714286 2.63571428571429 3.96071428571429 1.22857142857143
6.25250000000000 2.85500000000000 4.81500000000000 1.62500000000000
6.91250000000000 3.10000000000000 5.84687500000000 2.13125000000000
<essclust,iris,[0.1,7.9]^4,20> (k=5)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 46.4461.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 46.4461820512821
- Solution vector:
5.00600000000000 3.42800000000000 1.46200000000000 0.24600000000000
5.50800000000000 2.60000000000000 3.90800000000000 1.20400000000000
6.20769230769231 2.85384615384615 4.74615384615385 1.56410256410256
6.52916666666667 3.05833333333333 5.50833333333333 2.16250000000000
7.47500000000000 3.12500000000000 6.30000000000000 2.05000000000000
<essclust,iris,[0.1,7.9]^4,24> (k=6)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 39.0399.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 39.0399872460872
- Solution vector:
4.70454545454545 3.12272727272727 1.41363636363636 0.20000000000000
5.24285714285714 3.66785714285714 1.50000000000000 0.28214285714286
5.50800000000000 2.60000000000000 3.90800000000000 1.20400000000000
6.20769230769231 2.85384615384615 4.74615384615385 1.56410256410256
6.52916666666667 3.05833333333333 5.50833333333333 2.16250000000000
7.47500000000000 3.12500000000000 6.30000000000000 2.05000000000000
<essclust,iris,[0.1,7.9]^4,28> (k=7)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 34.2982.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 34.2982296650718
- Solution vector:
4.70454545454545 3.12272727272727 1.41363636363636 0.20000000000000
5.24285714285714 3.66785714285714 1.50000000000000 0.28214285714286
5.53214285714286 2.63571428571429 3.96071428571429 1.22857142857143
6.03684210526316 2.70526315789474 5.00000000000000 1.77894736842105
6.44210526315789 2.97894736842105 4.59473684210526 1.43157894736842
6.56818181818182 3.08636363636364 5.53636363636364 2.16363636363636
7.47500000000000 3.12500000000000 6.30000000000000 2.05000000000000
<essclust,iris,[0.1,7.9]^4,32> (k=8)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 29.9889.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 29.9889439507861
- Solution vector:
4.70454545454545 3.12272727272727 1.41363636363636 0.20000000000000
5.24285714285714 2.37142857142857 3.44285714285714 1.02857142857143
5.24285714285714 3.66785714285714 1.50000000000000 0.28214285714286
5.62857142857143 2.72380952380952 4.13333333333333 1.29523809523810
6.03684210526316 2.70526315789474 5.00000000000000 1.77894736842105
6.44210526315790 2.97894736842105 4.59473684210526 1.43157894736842
6.56818181818182 3.08636363636364 5.53636363636364 2.16363636363636
7.47500000000000 3.12500000000000 6.30000000000000 2.05000000000000
<essclust,iris,[0.1,7.9]^4,36> (k=9)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 27.7860.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 27.7860924173081
- Solution vector:
4.67894736842105 3.08421052631579 1.37894736842105 0.20000000000000
5.10000000000000 3.51304347826087 1.52608695652174 0.27391304347826
5.24285714285714 2.37142857142857 3.44285714285714 1.02857142857143
5.51250000000000 4.00000000000000 1.47500000000000 0.27500000000000
5.62857142857143 2.72380952380952 4.13333333333333 1.29523809523810
6.03684210526316 2.70526315789474 5.00000000000000 1.77894736842105
6.44210526315789 2.97894736842105 4.59473684210526 1.43157894736842
6.56818181818182 3.08636363636364 5.53636363636364 2.16363636363636
7.47500000000000 3.12500000000000 6.30000000000000 2.05000000000000
<essclust,iris,[0.1,7.9]^4,40> (k=10)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 25.8340.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 25.834054819972511
- Solution vector:
4.67894736842105 3.08421052631579 1.37894736842105 0.20000000000000
5.00000000000000 2.30000000000000 3.27500000000000 1.02500000000000
5.10000000000000 3.51304347826087 1.52608695652174 0.27391304347826
5.51250000000000 4.00000000000000 1.47500000000000 0.27500000000000
5.57647058823529 2.59411764705882 3.98235294117647 1.21764705882353
5.96875000000000 2.86250000000000 4.41250000000000 1.38125000000000
6.02777777777778 2.73333333333333 5.02777777777778 1.79444444444444
6.56818181818182 3.08636363636364 5.53636363636364 2.16363636363636
6.65454545454546 3.04545454545455 4.66363636363636 1.47272727272727
7.47500000000000 3.12500000000000 6.30000000000000 2.05000000000000
Ruspini's Data
This dataset has 75 observations of 2 variables, i.e. n=75, p=2. k can be varied. The range of data values over all variables is [4,156] therefore we can define the initial search region as [4,156]^2k.
Dataset (downloaded from the
http://www.unc.edu/~rls/s754/data/ruspini.txt, 5/12/2013). The dataset is also available in R (cluster package).
Problem Family Specification: <essclust,ruspini,(feasible bounds),(2pk)>
Instances
<essclust,ruspini,[4,156]^2,4> (k=2)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 89337.8.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 89337.832142857
- Solution vector:
41.0571428571429 45.4285714285714
66.9750000000000 132.800000000000
<essclust,ruspini,[4,156]^2,6> (k=3)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 51063.4.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 51063.4750456705
- Solution vector:
41.0571428571429 45.4285714285714
43.9130434782609 146.043478260870
98.1764705882353 114.882352941176
<essclust,ruspini,[4,156]^2,8> (k=4)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 12881.0.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 12881.0512361466
- Solution vector:
20.1500000000000 64.9500000000000
43.9130434782609 146.043478260870
68.9333333333333 19.4000000000000
98.1764705882353 114.882352941176
<essclust,ruspini,[4,156]^2,10> (k=5)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 10126.7.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 10126.7197881828
- Solution vector:
20.1500000000000 64.9500000000000
43.9130434782609 146.043478260870
68.9333333333333 19.4000000000000
80.5000000000000 100.250000000000
103.615384615385 119.384615384615
<essclust,ruspini,[4,156]^2,12> (k=6)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 8575.40.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 8575.40687645688
- Solution vector:
20.1500000000000 64.9500000000000
36.4166666666667 148.416666666667
52.0909090909091 143.454545454545
68.9333333333333 19.4000000000000
80.5000000000000 100.250000000000
103.615384615385 119.384615384615
<essclust,ruspini,[4,156]^2,14> (k=7)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 7126.19.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 7126.19854312354
- Solution vector:
20.0833333333333 58.0000000000000
20.2500000000000 75.3750000000000
36.4166666666667 148.416666666667
52.0909090909091 143.454545454545
68.9333333333333 19.4000000000000
80.5000000000000 100.250000000000
103.615384615385 119.384615384615
<essclust,ruspini,[4,156]^2,16> (k=8)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 6149.63.
This is likely to be the following solution (obtained with an implementation of CMA-ES):
- Objective function value: 6149.639019314019
- Solution vector:
9.99999999539073 57.0000000911560
18.1250000429247 74.5000000049416
29.7142856868073 59.7142857449624
36.4166666574633 148.416666664583
52.0909090563180 143.454545495861
68.9333333537336 19.3999999830322
80.5000000304624 100.249999853326
103.615384611450 119.384615391991
<essclust,ruspini,[4,156]^2,18> (k=9)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 5181.65.
This is likely to be the following solution (obtained with an implementation of CMA-ES):
- Objective function value: 5181.65183982684
- Solution vector:
9.99999995794373 57.0000000963804
18.1250000324401 74.4999999844186
29.7142856770504 59.7142857404662
36.4166666225643 148.416666701130
52.0909090595495 143.454545467191
68.9333332890974 19.3999999925601
79.0000000385553 95.3333332573537
95.3750000188838 121.375000071440
111.500000053734 116.000000093298
<essclust,ruspini,[4,156]^2,20> (k=10)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 4446.28.
This is likely to be the following solution (obtained with an implementation of CMA-ES):
- Objective function value: 4446.28214285714
- Solution vector:
9.99999998451028 57.0000001197382
18.1249999519795 74.4999999628034
48.3999999868827 148.400000017397
111.499999946052 115.999999979363
29.7142856779033 59.7142857268467
78.9999999811306 95.3333332546766
95.3750000125771 121.375000025982
34.8999999082685 147.600000013715
68.9333334001723 19.3999999926389
58.9999997698014 133.000000271693
German Towns (Spath) Data
This dataset has 89 observations of 3 variables, i.e. n=89, p=3. k can be varied. The range of data values over all variables is [24.49,1306024] therefore we can define the initial search region as [24.49,1306024]^3k.
Dataset (downloaded from
http://people.sc.fsu.edu/~jburkardt/datasets/spaeth2/spaeth2_07.txt, 3/9/2014).
Problem Family Specification: <essclust,germantowns,(feasible bounds),(2pk)>
Instances
<essclust,germantowns,[24.49,1306024]^3,6> (k=2)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 6.02546e11.
This is likely to be the following solution (obtained with an implementation of CMA-ES):
- Objective function value: 6.025472220938822e+11
- Solution vector:
657.585035229536e+000 995.903499354406e+003 2.32154979550654e+003
722.944520639560e+000 95.9721033505540e+003 165.042584998811e+000
<essclust,germantowns,[24.49,1306024]^3,9> (k=3)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 2.94506e11.
This is likely to be the following solution (obtained with an implementation of CMA-ES):
- Objective function value: 2.94506562778027e+11
- Solution vector:
330.929873001869e+000 1.30602400169001e+006 3.94649962550106e+003
713.780501415945e+000 91.6535581591426e+003 163.365137450846e+000
1.24763537265257e+003 576.575000521688e+003 502.949702896911e+000
<essclust,germantowns,[24.49,1306024]^3,12> (k=4)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 1.04474e11.
This is likely to be the following solution (obtained with an implementation of CMA-ES):
- Objective function value: 1.04474664100716e11
- Solution vector:
330.929743011785e+000 1.30602399951445e+006 3.94650004312708e+003
541.779866179642e+000 58.1246140071003e+003 156.899977893482e+000
1.05185106050775e+003 157.555275831868e+003 176.072345729327e+000
1.24763528729838e+003 576.574999890473e+003 502.949839416747e+000
<essclust,germantowns,[24.49,1306024]^3,15> (k=5)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 5.97615e10.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 5.97615267205269e10
- Solution vector:
330.929613274889e+000 1.30602399950604e+006 3.94649970137045e+003
495.780501388671e+000 51.4227550504051e+003 130.410264882195e+000
916.127773288099e+000 124.007071440605e+003 213.546517330764e+000
1.24763516914242e+003 576.575000286366e+003 502.950243522078e+000
1.27114572851323e+003 210.032555480127e+003 186.666483259584e+000
<essclust,germantowns,[24.49,1306024]^3,18> (k=6)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 3.59085e10.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 3.59085384380298e10
- Solution vector:
330.929200998046e+000 1.30602399948261e+006 3.94650038455577e+003
495.780409907402e+000 51.4227551216804e+003 130.410188360083e+000
916.127934385819e+000 124.007071422910e+003 213.546346185247e+000
984.240634772259e+000 685.782999989853e+003 696.598957973432e+000
1.27114552655582e+003 210.032555453881e+003 186.666559311675e+000
1.51102943023813e+003 467.367000019818e+003 309.300260613161e+000
<essclust,germantowns,[24.49,1306024]^3,21> (k=7)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 2.19832e10.
This is likely to be the following solution (obtained with an implementation of k-means):
- Objective function value: 2.19832076153985e10
- Solution vector:
330.930708580577e+000 1.30602399953804e+006 3.94649967136739e+003
371.017004507240e+000 34.4121739158693e+003 141.343474666540e+000
607.209020766809e+000 70.1958666486647e+003 175.816655078203e+000
966.464957678033e+000 128.939875029566e+003 160.166594427141e+000
984.239249144631e+000 685.782999690448e+003 696.599796202028e+000
1.27114552974574e+003 210.032555399712e+003 186.666779664602e+000
1.51103013762400e+003 467.367000014776e+003 309.300342958735e+000
<essclust,germantowns,[24.49,1306024]^3,24> (k=8)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 1.33854e10.
This is likely to be the following solution (obtained with an R clustering package (thanks Michael Banfield):
- Objective function value: 1.33854150525813e+10
- Solution vector:
330.930000000000e+000 1.30602400000000e+006 3.94650000000000e+003
363.336363636364e+000 33.6483636363636e+003 143.459090909091e+000
603.697777777778e+000 65.9055925925926e+003 119.777777777778e+000
861.246000000000e+000 107.830400000000e+003 235.840000000000e+000
984.240000000000e+000 685.783000000000e+003 696.600000000000e+000
1.04083400000000e+003 146.932666666667e+003 179.380000000000e+000
1.22295571428571e+003 220.149285714286e+003 204.428571428571e+000
1.51103000000000e+003 467.367000000000e+003 309.300000000000e+000
<essclust,germantowns,[24.49,1306024]^3,27> (k=9)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 7.80442e9. However subsequent papers in the literature (e.g. [Hansen et al., 2001; Bagirov 2008] use the value 8.42374e9. We assume this new value is the correct one (but if anyone can find a solution with the original value please let us know!)
This is likely to be the following solution (obtained with CMA-ES):
- Objective function value: 8.423750573163314e+09
- Solution vector:
330.930024284082e+000 1.30602399997975e+006 3.94650004300091e+003
363.336349369481e+000 33.6483636658907e+003 143.459087237367e+000
603.697721603636e+000 65.9055926171194e+003 119.777835290498e+000
861.246060122426e+000 107.830400005495e+003 235.839999706162e+000
979.453074850298e+000 142.672461498681e+003 187.823090154207e+000
984.239855488777e+000 685.783000096644e+003 696.599980272967e+000
1.24088206632887e+003 186.151799958002e+003 170.839920559014e+000
1.30897497157283e+003 239.883499934192e+003 206.450087966198e+000
1.51103000771089e+003 467.367000004925e+003 309.299928740850e+000
<essclust,germantowns,[24.49,1306024]^3,30> (k=10)
- Global Optimum: the value of the global optimum is reported in [Du Merle et al., 2000] as 6.44647e9.
This is likely to be the following solution (obtained with an R clustering package (thanks Michael Banfield):
- Objective function value: 6.44648364287013e+009
- Solution vector:
330.930000000000e+000 1.30602400000000e+006 3.94650000000000e+003
338.591000000000e+000 32.3351500000000e+003 150.105000000000e+000
534.996666666667e+000 56.2522666666667e+003 114.873333333333e+000
696.831875000000e+000 75.7002500000000e+003 119.662500000000e+000
866.603846153846e+000 110.421538461538e+003 252.915384615385e+000
979.453076923077e+000 142.672461538462e+003 187.823076923077e+000
984.240000000000e+000 685.783000000000e+003 696.600000000000e+000
1.24088200000000e+003 186.151800000000e+003 170.840000000000e+000
1.30897500000000e+003 239.883500000000e+003 206.450000000000e+000
1.51103000000000e+003 467.367000000000e+003 309.300000000000e+000
References
-
Bagirov, A.M. (2008). Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recognition 41(10), 3192-3199.
-
Bezdek, J. C. and Keller, J. M. and Krishnapuram, R. and Kuncheva, L. I. and Pal, N. R.
Will the real iris data please stand up?,
IEEE Transactions on Fuzzy Systems 7(3):368-369, 1999.
-
Du Merle, O., Hansen, P., Jaumard, B., & Mladenovic, N. (2000). An interior point algorithm for minimum sum-of-squares clustering. SIAM Journal on Scientific Computing, 21(4), 1485-1505.
-
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179-188.
-
Hansen, P., Mladenovic, N. (2001). J-means: a new local search heuristic for minimum sum of squares clustering. Pattern recognition 34(2), 405-413.
-
Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59(1), 1-34.
-
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3), 645-678.
Up: Continuous Real-World-Representative Problem Repository
Last modified: 30/01/17.
Marcus Gallagher