yakmo - C++ implementation of robust, efficient alternative k-means clustering

developed by Naoki Yoshinaga at Yoshinaga Lab., IIS, University of Tokyo
Skip to [ Features | Download | History | Usage | Performance | Tuning | References ]

About

Yakmo implements robust, efficient k-means clustering with triangular inequality [1] and smart initialization [2], while supporting alternative clustering outputs [3]. The use of the triangular inequality allows k-means to skip unnecessary distance calculations, while the smart initialization by randomized seeding (k-means++) not only improves solution accuracy but also accelerates the convergence of the algorithm. In addition, you can obtain alternative clusterings via orthogonalization [3].

Features

License (except test directory): GNU GPLv2 and LGPLv2.1; or e-mail me for other licenses you want.

Download & Setup

> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/yakmo/yakmo-latest.tar.gz
> tar zxvf yakmo-latest.tar.gz
> cd yakmo-YYYY-MM-DD
> configure
> make install

For Mac OS X users: try port yakmo via MacPorts (special thanks to @hjym_u).

Requirements

ToDo

History

Usage

command-line options

Typing ./yakmo --help shows the following usage information.

yakmo - yet another k-means via orthogonalization
Copyright (c) 2012 Naoki Yoshinaga, All rights reserved.

Usage: yakmo [options] train model test

train     train file        set '-' to skip clustering
model     model file        set '-' to train/test w/o saving centroids
test      test  file        set '-' to output clustering results of train

Optional parameters in training and testing:
  -t, --dist-type=TYPE      select type of distance function
                            * 0 - Euclidean
  -c, --init-centroid=TYPE  select type of distance function
                              0 - random
                            * 1 - k-means++
  -k, --num-cluster=NUM     k in k-means (3)
  -m, --num-result=NUM      number of alternative results (1)
  -i, --iteration=NUM       maximum number of iterations per clustering (100)
  -n, --normalize           normalize L2-norm of data points
  -O, --output=TYPE         select output type of testing
                            * 0 - no output
                              1 - report assignment per cluster
                              2 - report assignment per item
  -v, --verbose             verbosity level (1)
  -h, --help                show this help and exit

file format for train / test

(BNF-like notation)
<label> := string // without space
<index> := integer (>= 0)
<value> := real   // value will be ignored
<line>  := <label> <index>:<value> <index>:<value> ... <index>:<value>

example

> # prepare test data; here covtype from LIBSVM web site
> wget http://www.csie.ntu.edu.tw/\~cjlin/libsvmtools/datasets/multiclass/covtype.bz2
> bunzip2 covtype.bz2

# running k-means (k=3) and output a mapping from centroid to cluster
> yakmo -k 2 covtype - - -O 2 | head
mode: TRAIN
iter=1 k-means (k=2): ....................done; obj = 1.34189e+12.
5 0
5 0
2 1
2 1
5 0
2 0
5 0
5 0
5 0
5 0

# running k-means (k=2) with verbose messages, while saving centroids
> yakmo -k 2 covtype cents - -v 2
mode: TRAIN
iter=1 k-means (k=2): 
    1: obj = 1.450450e+12; #moved =  78664
    2: obj = 1.384814e+12; #moved =  42378
    3: obj = 1.358518e+12; #moved =  25677
    4: obj = 1.347388e+12; #moved =  15959
    5: obj = 1.343423e+12; #moved =   8875
    6: obj = 1.342279e+12; #moved =   4597
    7: obj = 1.341983e+12; #moved =   2298
    8: obj = 1.341909e+12; #moved =   1145
    9: obj = 1.341891e+12; #moved =    534
   10: obj = 1.341887e+12; #moved =    277
   11: obj = 1.341886e+12; #moved =    141
   12: obj = 1.341886e+12; #moved =     63
   13: obj = 1.341886e+12; #moved =     39
   14: obj = 1.341886e+12; #moved =     22
   15: obj = 1.341886e+12; #moved =     11
   16: obj = 1.341886e+12; #moved =      7
   17: obj = 1.341886e+12; #moved =      3
   18: obj = 1.341886e+12; #moved =      2
   19: obj = 1.341886e+12; #moved =      1
   20: obj = 1.341886e+12; #moved =      0
done.
train     : 1886.2985 ms.

# assigning data points to the closest centroids
> yakmo - cents covtype -O 2 | head
mode: TEST
5 0
5 0
2 1
2 1
5 0
2 0
5 0
5 0
5 0
5 0

Performance Comparison

I don't know which clustering library boasts the clustering speed.

Disclaimer

We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.

How to pronounce `yakmo'

I usually pronounce yakmo in the same way as Yakumo, a word meaning layered clouds in Japanese.

References

  1. G. Hamerly. Making k-means even faster. Proc. SDM. pp. 130--140. 2010.
  2. D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. Prof. SODA. pp. 1027--1035. 2007
  3. Y. Cui et al. Non-redundant multi-view clustering via orthogonalization. Proc. ICDM, pp. 133--142. 2007.
  4. B. Bahmani et al. Scalable k-means++. Proc. VLDB Endowment, pp. 622--633. 2012.

Copyright © 2012 Naoki Yoshinaga, All right Reserved.
XHTML 1.1 $ Last modified at Thu May 12 01:44:56 2016 $