yakmo - C++ implementation of robust, efficient alternative k-means clustering

developed by Naoki Yoshinaga at Yoshinaga Lab., IIS, University of Tokyo

About

Yakmo implements robust, efficient k-means clustering with triangular inequality [1] and smart initialization [2], while supporting alternative clustering outputs [3]. The use of the triangular inequality allows k-means to skip unnecessary distance calculations, while the smart initialization by randomized seeding (k-means++) not only improves solution accuracy but also accelerates the convergence of the algorithm. In addition, you can obtain alternative clusterings via orthogonalization [3].

Features

Simple, compact code: Yakmo is implemented as a simple header library; the main code for k-means with triangular inequality and smart initialization occupies ~400 lines (~70/100 lines for point/centroid class), while orthogonal k-means occupies ~200 lines.
Scalable clustering for sparse input: yakmo stores the input data points in sparse vector format, while it keeps centroids of clusters in dense vector format. This data representation greatly reduces space needed in clustering with sparse input, without slowing down the clustering speed.
Sparse data format: yakmo supports sparse data format similar to SVM^light; the preceding label is used to distinguish individual data points.
Norm normalization [-n]: normalizing l²-norm of input to 1 makes Euclidean distance L_2 = |x-c|^2 (x: data point, c: centroid) to be close to the complement of cosine similarity L_2 = |x-c|^2 = 1 + |c|^2 - 2|c|cosθ (|c|^2 = 1/N + (N-1)/N * cosθ_c, N: # points assigned to c, cosθ_c: average cosθ between points assigned to c).
Various clustering output [-O 0,1,2]: you can get clustering outputs as a mapping from cluster id to labels of data points ([-O 1]) or a mapping from each data points (label) to cluster id. Even if you discard the clustering output [-O 0], you can recover the output from the saved centroids.
Static clustering support: you can assign unseen data points to the closest centroid among those obtained by the prior clustering. This option is useful when you need to handle an extremely large dataset; first perform clustering with a small, representative subset of the dataset to determine rough shape of clusters (centroids), and then assign the remaining data points to the clusters with the closest centroid.

License (except test directory): GNU GPLv2 and LGPLv2.1; or e-mail me for other licenses you want.

Download & Setup

> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/yakmo/yakmo-latest.tar.gz
> tar zxvf yakmo-latest.tar.gz
> cd yakmo-YYYY-MM-DD
> configure
> make install

For Mac OS X users: try port yakmo via MacPorts (special thanks to @hjym_u).

Requirements

OS: 32/64-bit UNIX-compatible OS (tested on Linux / Mac OS X)
Compiler: tested with GNU gcc (≥ 4.0) or clang (≥ 2.9)

ToDo

Support k-means||[4] initialization.
Support other distance metrics.
Support kernel k-means.
Support parallel k-means.
Support learning k.

History

January 21st, 2015 (xz) (development; minor bug/typo fixes and docs only):
- Fix a bug (inifinite loop) in using k-means++ to initialize centroids for data with <k distinct points (thanks to Gleb).
December 25th, 2013 (xz):
- Fix a configuration error in clang (Mac OS X Mavericks).
April 22nd, 2013 (xz):
- Support random seed initialization [-r] to have different local optima.
- Fix a typo.
December 26th, 2012 (xz):
- initial release.

Usage

command-line options

Typing ./yakmo --help shows the following usage information.

yakmo - yet another k-means via orthogonalization
Copyright (c) 2012 Naoki Yoshinaga, All rights reserved.

Usage: yakmo [options] train model test

train     train file        set '-' to skip clustering
model     model file        set '-' to train/test w/o saving centroids
test      test  file        set '-' to output clustering results of train

Optional parameters in training and testing:
  -t, --dist-type=TYPE      select type of distance function
                            * 0 - Euclidean
  -c, --init-centroid=TYPE  select type of distance function
                              0 - random
                            * 1 - k-means++
  -k, --num-cluster=NUM     k in k-means (3)
  -m, --num-result=NUM      number of alternative results (1)
  -i, --iteration=NUM       maximum number of iterations per clustering (100)
  -n, --normalize           normalize L2-norm of data points
  -O, --output=TYPE         select output type of testing
                            * 0 - no output
                              1 - report assignment per cluster
                              2 - report assignment per item
  -v, --verbose             verbosity level (1)
  -h, --help                show this help and exit

file format for train / test

(BNF-like notation)
<label> := string // without space
<index> := integer (>= 0)
<value> := real   // value will be ignored
<line>  := <label> <index>:<value> <index>:<value> ... <index>:<value>

example

> # prepare test data; here covtype from LIBSVM web site
> wget http://www.csie.ntu.edu.tw/\~cjlin/libsvmtools/datasets/multiclass/covtype.bz2
> bunzip2 covtype.bz2

# running k-means (k=3) and output a mapping from centroid to cluster
> yakmo -k 2 covtype - - -O 2 | head
mode: TRAIN
iter=1 k-means (k=2): ....................done; obj = 1.34189e+12.
5 0
5 0
2 1
2 1
5 0
2 0
5 0
5 0
5 0
5 0

# running k-means (k=2) with verbose messages, while saving centroids
> yakmo -k 2 covtype cents - -v 2
mode: TRAIN
iter=1 k-means (k=2): 
    1: obj = 1.450450e+12; #moved =  78664
    2: obj = 1.384814e+12; #moved =  42378
    3: obj = 1.358518e+12; #moved =  25677
    4: obj = 1.347388e+12; #moved =  15959
    5: obj = 1.343423e+12; #moved =   8875
    6: obj = 1.342279e+12; #moved =   4597
    7: obj = 1.341983e+12; #moved =   2298
    8: obj = 1.341909e+12; #moved =   1145
    9: obj = 1.341891e+12; #moved =    534
   10: obj = 1.341887e+12; #moved =    277
   11: obj = 1.341886e+12; #moved =    141
   12: obj = 1.341886e+12; #moved =     63
   13: obj = 1.341886e+12; #moved =     39
   14: obj = 1.341886e+12; #moved =     22
   15: obj = 1.341886e+12; #moved =     11
   16: obj = 1.341886e+12; #moved =      7
   17: obj = 1.341886e+12; #moved =      3
   18: obj = 1.341886e+12; #moved =      2
   19: obj = 1.341886e+12; #moved =      1
   20: obj = 1.341886e+12; #moved =      0
done.
train     : 1886.2985 ms.

# assigning data points to the closest centroids
> yakmo - cents covtype -O 2 | head
mode: TEST
5 0
5 0
2 1
2 1
5 0
2 0
5 0
5 0
5 0
5 0

Performance Comparison

I don't know which clustering library boasts the clustering speed.

Disclaimer

We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.

How to pronounce `yakmo'

I usually pronounce yakmo in the same way as Yakumo, a word meaning layered clouds in Japanese.

References

G. Hamerly. Making k-means even faster. Proc. SDM. pp. 130--140. 2010.
D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. Prof. SODA. pp. 1027--1035. 2007
Y. Cui et al. Non-redundant multi-view clustering via orthogonalization. Proc. ICDM, pp. 133--142. 2007.
B. Bahmani et al. Scalable k-means++. Proc. VLDB Endowment, pp. 622--633. 2012.