Yakmo implements robust, efficient k-means clustering with triangular inequality [1] and smart initialization [2], while supporting alternative clustering outputs [3]. The use of the triangular inequality allows k-means to skip unnecessary distance calculations, while the smart initialization by randomized seeding (k-means++) not only improves solution accuracy but also accelerates the convergence of the algorithm. In addition, you can obtain alternative clusterings via orthogonalization [3].
[-n]
: normalizing l2-norm of input to 1 makes Euclidean distance L_2 = |x-c|^2 (x: data point, c: centroid)
to be close to the complement of cosine similarity L_2 = |x-c|^2 = 1 + |c|^2 - 2|c|cosθ (|c|^2 = 1/N + (N-1)/N * cosθ_c, N: # points assigned to c, cosθ_c: average cosθ between points assigned to c)
.[-O 0,1,2]
: you can get clustering outputs as a mapping from cluster id to labels of data points ([-O 1]
) or a mapping from each data points (label) to cluster id. Even if you discard the clustering output [-O 0]
, you can recover the output from the saved centroids.License (except test directory): GNU GPLv2 and LGPLv2.1; or e-mail me for other licenses you want.
> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/yakmo/yakmo-latest.tar.gz
> tar zxvf yakmo-latest.tar.gz
> cd yakmo-YYYY-MM-DD
> configure
> make install
For Mac OS X users: try port yakmo via MacPorts (special thanks to @hjym_u).
[-r]
to have different local optima.command-line options
Typing ./yakmo --help
shows the following usage information.
yakmo - yet another k-means via orthogonalization
Copyright (c) 2012 Naoki Yoshinaga, All rights reserved.
Usage: yakmo [options] train model test
train train file set '-' to skip clustering
model model file set '-' to train/test w/o saving centroids
test test file set '-' to output clustering results of train
Optional parameters in training and testing:
-t, --dist-type=TYPE select type of distance function
* 0 - Euclidean
-c, --init-centroid=TYPE select type of distance function
0 - random
* 1 - k-means++
-k, --num-cluster=NUM k in k-means (3)
-m, --num-result=NUM number of alternative results (1)
-i, --iteration=NUM maximum number of iterations per clustering (100)
-n, --normalize normalize L2-norm of data points
-O, --output=TYPE select output type of testing
* 0 - no output
1 - report assignment per cluster
2 - report assignment per item
-v, --verbose verbosity level (1)
-h, --help show this help and exit
file format for train / test
(BNF-like notation)
<label> := string // without space
<index> := integer (>= 0)
<value> := real // value will be ignored
<line> := <label> <index>:<value> <index>:<value> ... <index>:<value>
example
> # prepare test data; here covtype from LIBSVM web site
> wget http://www.csie.ntu.edu.tw/\~cjlin/libsvmtools/datasets/multiclass/covtype.bz2
> bunzip2 covtype.bz2
# running k-means (k=3) and output a mapping from centroid to cluster
> yakmo -k 2 covtype - - -O 2 | head
mode: TRAIN
iter=1 k-means (k=2): ....................done; obj = 1.34189e+12.
5 0
5 0
2 1
2 1
5 0
2 0
5 0
5 0
5 0
5 0
# running k-means (k=2) with verbose messages, while saving centroids
> yakmo -k 2 covtype cents - -v 2
mode: TRAIN
iter=1 k-means (k=2):
1: obj = 1.450450e+12; #moved = 78664
2: obj = 1.384814e+12; #moved = 42378
3: obj = 1.358518e+12; #moved = 25677
4: obj = 1.347388e+12; #moved = 15959
5: obj = 1.343423e+12; #moved = 8875
6: obj = 1.342279e+12; #moved = 4597
7: obj = 1.341983e+12; #moved = 2298
8: obj = 1.341909e+12; #moved = 1145
9: obj = 1.341891e+12; #moved = 534
10: obj = 1.341887e+12; #moved = 277
11: obj = 1.341886e+12; #moved = 141
12: obj = 1.341886e+12; #moved = 63
13: obj = 1.341886e+12; #moved = 39
14: obj = 1.341886e+12; #moved = 22
15: obj = 1.341886e+12; #moved = 11
16: obj = 1.341886e+12; #moved = 7
17: obj = 1.341886e+12; #moved = 3
18: obj = 1.341886e+12; #moved = 2
19: obj = 1.341886e+12; #moved = 1
20: obj = 1.341886e+12; #moved = 0
done.
train : 1886.2985 ms.
# assigning data points to the closest centroids
> yakmo - cents covtype -O 2 | head
mode: TEST
5 0
5 0
2 1
2 1
5 0
2 0
5 0
5 0
5 0
5 0
I don't know which clustering library boasts the clustering speed.
We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.
I usually pronounce yakmo in the same way as Yakumo, a word meaning layered clouds in Japanese.