opal - C++ implementation of online learning with kernel slicing

developed by Naoki Yoshinaga at Yoshinaga Lab., IIS, University of Tokyo
Skip to [ Features | Download | History | Usage | Performance | Tuning | References ]
Mikkeller Baghaven - Opal

About

Opal is a C++ library of online learning with kernel slicing [1], which enables large-scale training with binary conjunctive features (polynomial kernel). Opal is fast because it reuses temporal partial margins computed in past rounds and truncates margin computations that do not contribute to parameter updates.

Opal is x30-x250 faster than a kernel-based online learner, while retaining its space-efficiency in several NLP tasks. It supports training of linear and kernel-based model (polynomial kernel) with binary features.

If you make use of opal for research or commercial purposes, the reference will be:

N. Yoshinaga and M. Kitsuregawa. Kernel Slicing: Scalable Online Training with Conjunctive Features. Proc. COLING 2010 (oral), pp. 1245--1253. 2010.

Features

The following table summarizes the features of opal variants and pecco in training/testing.

featuresopal [-kp]opal [-k]opal [-p]opalopalkpecco
trainingkernel expansion/splittingyesyesyesyesnon/a
kernel slicingyesyesnononon/a
truncated marginsyesnoyesnonon/a
testingkernel expansion/splittingyesyesyesyesnoyes
kernel slicing (or fstrie)yesyesnononoyes
adaptive classificationyesyesnononono
truncated marginsyesnoyesnonoyes

License (except test directory): GNU GPLv2, LGPLv2.1, and BSD; or e-mail me for other licenses you want.

NOTE: The distribution includes in test directory the labeled examples for a transition-based dependency parser and hyponymy relation identification, all of which are used in the paper.

Download & Setup

> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/opal/opal-latest.tar.gz
> tar zxvf opal-latest.tar.gz
> cd opal-YYYY-MM-DD
> configure
> make install

For Mac OS X users: try port opal-ml via MacPorts (special thanks to @hjym_u).

In the paper, PA-I SL, PA-I SL*, PA-I SP and PA-I KERNEL refer to opal [-k -p], opal [-p], opal, and opalk, respectively.

Requirements

ToDo

History

Usage

command-line options

Typing ./opal --help shows the following usage information; test/ includes training and testing data used in the paper [1].

opal - online learning with kernel slicing
Copyright (c) 2009-2013 Naoki Yoshinaga, All rights reserved.

Usage: src/opal [options] train model test

train   training file       set '-' to skip training
model   model file          set '-' to training/test w/o saving a model
test    test file           set '-' to skip testing

Optional parameters in training and testing:
  -t, --kernel=TYPE         select type of kernel function
                            * 0 - linear     (w^T * x)
                              1 - polynomial (s^T * x + 1)^d
  -d, --kernel-degree=INT   parameter d in polynomial kernel (0)
      --kernel-splitN=INT   # common features in kernel splitting (0)
      --max-trie-size=INT   adjust # common features to fit the feature trie
                            within RAM size (MiB) (32)
  -p, --pruning-margin      terminate margin computation if unnecessarily
  -k, --kernel-slicing      perform kernel slicing
  -O, --output=TYPE         select output type of testing
                             * 0 - report accuracy
                               1 - report accuracy per iteration
                               2 - labels
                               3 - labels and margins
                               4 - labels and probabilities
  -o, --output!=TYPE        output examples with labels/margins/probabilities

Optional parameters in training:
  -l, --learner=TYPE        select learning algorithm
                              0 - Perceptron
                              1 - Passive Aggressive    (PA)
                            * 2 - Passive Aggressive I  (PA-I)
                              3 - Passive Aggressive II (PA-II)
  -c, --reg-cost=FLOAT      PA-I/-II aggressiveness parameter C (1.0)
                            ([i * avg. k(x,x)]^-1 if C is set to 0)
  -i, --iteration=INT       # iterations (10)
  -a, --averaging           average parameters
  -s, --shuffling           shuffle training examples on RAM
  -b, --buffer=TYPE         select type of buffer to read examples
                            * 0 - Use RAM to load examples
                              1 - Use DISK to cache examples
                              2 - Do not cache examples
      --model0=FILE         re-train a model trained w/ opal ('-')
      --feat-threshold=INT  thresholding features by frequency (1)
  -M, --max-examples=INT    max. # examples used in training (0: all)
  -P, --probability-output  perform sigmoid fitting to output probabilities

Misc.:
  -h, --help               show this help and exit

file format for train / test

(BNF-like notation)
<class>   := +1 | -1 (or any string starting w/o '#' in multi-class classification)
<feature> := integer // >=1
<value>   := string (w/o ' ' and '\t') // value is ignored (always assumed to be 1)
<line>    := <class> <feature>:<value> <feature>:<value> ... <feature>:<value> | # ...

Refer to test/*.{train,dev,test}. As in SVMlight, feature/value pairs must be ordered by increasing feature number (values are ignored and assumed to be 1).

If you want to use a string feature, configure --enable-string-feature then you can use any string (without spaces and ':') as a feature (and you can omit a feature value). Namely,

(BNF-like notation)
<class>   := +1 | -1 (or any string starting w/o '#' in multi-class classification)
<feature> := string (w/o ' ', '\t' and ':')
<value>   := string (w/o ' ' and '\t') // value is ignored (always assumed to be 1)
<line>    := <class> <feature>[:<value>] <feature>[:<value>] ... <feature>[:<value>] | # ...

example

# linear training w/ Averaged Perceptron [-l 0, -a] with shuffling examples
> opal -t 0 -l 0 -c 1.0 -i 5 -as train model -

# linear training and testing w/ Passive Aggressive-I w/o saving model
> opal -t 0 -c 0.005 -i 20 train - test

# kernel-based training and testing (polynomial kernel of d=2) with Passive Aggressive-I
#  (event on Disk; shuffling ignored)
> opal -t 1 -d 2 -i 20 -c 0.0005 -a -b 1 train model test

# kernel-based testing
#  (only combinations among 500 common features are explicitly considered)
> opal --kernel-splitN 500 - model test

NOTE:

Performance comparison

The following tables show a comparison among linear learners (opal; ver. in 2011, oll 0.2, AROW++ 0.05, libarow 0.2.0, Vowpal Wabbit 4.1, SVMlin 1.0 and LIBLINEAR 1.7) and a comparison among kernel-based learners (opal, LIBLINEAR-POLY2 1.7 ([-d 2] only) and TinySVM 0.09), and LIBSVM 3.0: for LIBLINEAR, results with the fastest algorithm and the algorithm that obtained the most accurate model are shown. pecco (FST [-t 1]) is also used to test the models obtained with opal ([-d 2,3]; fstrie is build from the training data). The experiments were conducted on Mac OS X 10.5 over Intel Xeon E5462 3.2Ghz CPU with 32GB main memory. Only oll uses 4-byte float as floating-point numbers (others use 8-byte double). AROW++ and libarow cannot output margins of examples (others (can) do). opal, LIBLINEAR-POLY2 (modified to handle a large number of conjunctive features; int->int64_t) and TinySVM (modified for the experiments in the paper) and LIBSVM (add -m64, to make speed and memory comparable to 64-bit TinySVM) adopt 64-bit addressing, while others adopt 32bit addressing. These differences may cause 10-20% difference in time, and at most 100% difference in space (32bit v.s. 64bit). opal with [-b 1] and Vowpal Wabbit do not load events on memory; this leads to slower but memory-efficient training (demand additional Disk space for caching data, though); in the experiments, examples for these learners were offline shuffled to make the accuracy of the obtained models comparable to the others.

We have tried parameters # iters = 1, 5, 10, 20 (online learners only), and c/r = 100.0, 50.0, ..., 0.000005, 0.000001; the table lists the model that achieved the best accuracy on the development set. Due to intricate parameters related to learning rate in SGD, we have only tried l= 1.0, 0.5, 0.2, .., 0.02, 0.01 for Vowpal Wabbit. The gettimeofday command is used to measure the run-time (namely, the whole training or testing time).

Dependency Parsing (DEP in the paper, opal/test/tl.*, # training / test examples = 296,776 / 113,716)

Linear training:

SoftwareAlgorithmd#
iters
c (r), lAcc.
(%)
Mem
[MiB]
Training
[s]
Test
[s]
#
SVs
opalP [-l 0 -as]-10-90.7649.91.10.10-
opalPA-I [-l 2 -s]-200.00500090.9046.11.70.10-
opalPA-I [-l 2 (-s) -b 1]-200.00500090.908.62.20.10-
opalPA-I [-l 2 -as]-200.01000091.2149.92.00.10-
ollAP-5-90.82126.817.36.08-
ollPA-I-200.00500091.06125.518.36.07-
ollCW-10.00005088.83126.817.26.13-
AROW++AROW-550.00000090.41186.716.95.33-
libarowAROW-10100.00000090.48149.24.9-
Vowpal WabbitSGD-200.50000091.0121.4161.64.11-
SVMlinL2-SVM-MFN [-A 1]-80.10000090.94132.3126.31.24-
LIBLINEARL1-reg L2-SVM [-s 5]-190.10000090.84219.14.50.65-
LIBLINEARL2-reg L1-SVM (dual) [-s 3]-2850.50000091.10114.412.70.6877700

Kernel-based training (polynomial kernel):

SoftwareAlgorithmd#
iters
cAcc.
(%)
Mem
[MiB]
Training
[s]
Test
[s]
#
SVs
opal (+ pecco)P [-l 0 --kernel-splitN 2000 -p -as]25-92.9472.13.50.72 (0.32)59427
opal (+ pecco)PA-I [-l 2 --kernel-splitN 2000 -p -s]2200.00010093.0379.713.50.90 (0.35)94820
opal (+ pecco)PA-I [-l 2 --kernel-splitN 4000 -p -as]2200.00050093.2084.614.11.19 (0.35)83867
LIBLINEAR-POLY2L2-reg L1-SVM (dual) [-s 3]2660.00500093.2416013.8422.8310.2766056
TinySVM (+ pecco)SVM24388470.00500093.24246.430040.31739.92 (0.33)65158
LIBSVM (+ pecco)SVM22619160.00500093.24266.723302.21694.98 (0.33)65104
opal (+ pecco)P [-l 0 --kernel-splitN 250 -kp -as]35-93.07151.216.12.16 (1.17)57295
opal (+ pecco)PA-I [-l 2 --kernel-splitN 250 -kp -s]3200.00000593.32164.184.62.89 (1.29)89651
opal (+ pecco)PA-I [-l 2 --kernel-splitN 250 -kp -as]3200.00001093.36178.980.72.83 (1.26)85863
TinySVM (+ pecco)SVM33062910.00010093.38244.826048.61896.05 (1.19)68530
LIBSVM (+ pecco)SVM32027410.00010093.38266.122457.01785.52 (1.20)68499

NOTE: the above accuracy is computed against the test examples, and is not directly comparable to those in the paper, which shows dependency accuracy.

Hyponymy Relation Identification (REL in the paper, opal/test/wiki.*, # training / test examples = 201,664 / 10,000)

Linear training:

SoftwareAlgorithmd#
iters
c (r), lAcc.
(%)
Mem
[MiB]
Training
[s]
Test
[s]
#
SVs
opalP [-l 0 -as]-10-91.7422.40.60.02-
opalPA-I [-l 2 -s]-200.010092.0820.71.00.02-
opalPA-I [-l 2 (-s) -b 1]-200.010092.084.41.20.02-
opalPA-I [-l 2 -as]-200.050091.7422.91.20.02-
ollAP-20-91.2040.57.50.34-
ollPA-I-200.010091.8939.47.50.35-
ollCW-10.500091.0240.66.60.35-
AROW++AROW-2050.000091.6059.99.30.30-
libarowAROW-1050.000091.6268.42.0-
Vowpal WabbitSGD-201.000091.6521.8110.80.38-
SVMlinL2-SVM-MFN [-A 1]-110.500091.8461.8104.20.17-
LIBLINEARL2-reg L2-SVM (dual) [-s 1]-210.100091.7249.72.80.25110084
LIBLINEARL2-reg L2-SVM (primal) [-s 2]-400.500091.7665.833.30.33-

Kernel-based training (polynomial kernel):

SoftwareAlgorithmd#
iters
cAcc.
(%)
Mem
[MiB]
Training
[s]
Test
[s]
#
SVs
opal (+ pecco)P [-l 0 --kernel-splitN 2000 -p -as]210-92.7856.54.00.30 (0.15)42261
opal (+ pecco)PA-I [-l 2 --kernel-splitN 2000 -p -s]2200.001092.8964.614.80.53 (0.23)81429
opal (+ pecco)PA-I [-l 2 --kernel-splitN 2000 -p -as]2200.005092.9362.211.90.46 (0.20)67919
TinySVM (+ pecco)SVM22580810.010092.85139.812378.359.60 (0.22)77151
LIBSVM (+ pecco)SVM22059930.010092.85175.016695.3110.14 (0.22)77240
opal (+ pecco)P [-l 0 --kernel-splitN 125 -kp -as]310-92.89113.56.20.46 (0.29)35902
opal (+ pecco)PA-I [-l 2 --kernel-splitN 250 -kp -s]3200.000592.99152.711.60.64 (0.35)56772
opal (+ pecco)PA-I [-l 2 --kernel-splitN 250 -kp -as]3200.000593.07153.111.80.65 (0.36)56772
TinySVM (+ pecco)SVM33517580.010093.09140.817403.379.88 (0.50)91873
LIBSVM (+ pecco)SVM32398810.010093.09174.718082.2129.64 (0.50)92009

NOTE: LIBLINEAR-POLY2 did not work in this dataset (it should require ~400G memory space).

Tuning hyper-parameters

You may want to consider the following optimization strategy for tuning parameter [-c] and the number of iterations [-i] in the default learner PA-I [-l 1]. Keeping other options fixed as default (other than parameters mentioned below), first tune [-c] on development set, and then halve [-c] and double [-i] to conduct a more cautious optimization. Following this simple procedure, You can easily obtain a model whose accuracy is comparable to those trained with batch learning.

Averaging [-a] is must for Perceptron [-l 0] while optional for variants of PA [-l 1,2,3]; however, I still recommend to set [-a] in PA because averaging allows us to reduce the number of iterations needed to reach the best performance. Shuffling [-s] is must if the training examples are sorted by some biased order, otherwise optional (generally, recommended). I usually unset [-a] or [-s] in a task where neighboring examples are closely related with each other (e.g., deterministic dependency parsing); in such a case, shuffling just slows down the training, and the accuracy of the resulting model could (slightly) drop.

Setting [-p] or [-kp] is highly recommended to speed up training with [-t 1 -d 2] or [-t 1 -d 3], respectively. Remember that when you use variants of PA [-l 1,2,3] as a learning algorithm, kernel slicing [-k] will introduce some `negligible' rounding errors (meaning the resulting model is not identical in practice though theoretically the same), while truncating margin computations [-p] is unlikely to suffer from rounding errors (You can usually get the identical model). Never mind, anyway.

Finally, you do not need to worry about tuning parameter [--kernel-splitN]; otherwise the memory footprint will be adjusted according to the value of [-T]. If you plan to perform training several times with the same dataset, you may want to set [--kernel-splitN] to the value finally reached in the first training; this will slightly speed up the training.

Disclaimer

We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.

How to pronounce `opal'?

Opal was originally meant for online passive-aggressive algorithms; the efficiency attributes to obsolete partial-margins in the active lifespan.

References

  1. N. Yoshinaga and M. Kitsuregawa. Kernel Slicing: Scalable Online Training with Conjunctive Features. Proc. COLING 2010 (oral), pp. 1245--1253. 2010.
  2. Y. Freund and R. E. Schapire. Large Margin Classification using the Perceptron Algorithm. Machine Learning 37(3):277--296, 1999.
  3. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online Passive-Aggressive Algorithms. JMLR 7(Mar):551--585. 2006.
  4. K. Crammer and Y. Singer. Ultraconservative Online Algorithms for Multiclass Problems. JMLR 3(Jan):951--991, 2003.
  5. S. Matsushima, N. Shimizu, K. Yoshida, T. Ninomiya, H. Nakagawa. Exact Passive-Aggressive Algorithm for Multiclass Classification using Support Class. Proc. SDM, pp. 301--314. 2010.
  6. H.-T. Lin, C.-J. Lin, and R. C. Weng. A Note on Platt's Probabilistic Outputs for Support Vector Machines. Journal of Machine Learning 68(3):267 - 276. 2007
  7. S. Yata and M. Tamura and K. Morita and M. Fuketa and J. Aoe. Sequential Insertions and Performance Evaluations for Double-arrays. Proc. the 71st National Convention of IPSJ, pp. 1263--1264. 2009.
  8. R. McDonald, K. Hall, and G. Mann. Distributed Training Strategies for the Structured Perceptron. Proc. NAACL, pp. 456--464. 2010.

Copyright © 2010-2012 Naoki Yoshinaga, All right Reserved.
XHTML 1.1 $ Last modified at Thu May 12 01:43:59 2016 $