[-t 1]
based on kernel slicing [-kp]
: currently, Perceptron [2] [-l 0]
, passive-aggressive algorithms [PA, PA-I, PA-II] [3] [-l 1,2,3]
, and their multi-class extensions [4,5] are implemented.[-t 0]
: usually faster than LIBLINEAR, Vowpal Wabbit, oll or other major OSS.[-k]
; optional): though it takes additional memory space (sub-)linear to the test data size, testing with opal (with [-k]
option) becomes faster as it processes more data (I recommend pecco for testing a kernel-based model).[-O 4]
: Platt's sigmoid fitting is used to convert a margin to a probability (binary classifier only) [6]. Remember to set [-P] to perform the fitting.[-m]
: you can use the model that has been trained w/ opal as an initial model.[-b 1,2,3]
: you can buffer training example on memory (time-efficient) [-b 0]
, buffer event on disk (space-efficient) [-b 1]
, or no buffering (slowest but no resource needed) [-b 2]
.[-a]
& shuffling [-s]
stabilize the model accuracy; averaged PA trains a model with a larger c value (aggressive update), which requires a smaller number of iterations than PA to obtain the same accuracy.configure --enable-string-feature
will support string features.opal[-multi] - model -
will dump weights of features; for a kernel model, it will mine effective conjunctive features.std::unordered_map
, when the data access is skewed; this library is now released as cedar.swig/
.The following table summarizes the features of opal variants and pecco in training/testing.
features | opal [-kp] | opal [-k] | opal [-p] | opal | opalk | pecco | |
---|---|---|---|---|---|---|---|
training | kernel expansion/splitting | yes | yes | yes | yes | no | n/a |
kernel slicing | yes | yes | no | no | no | n/a | |
truncated margins | yes | no | yes | no | no | n/a | |
testing | kernel expansion/splitting | yes | yes | yes | yes | no | yes |
kernel slicing (or fstrie) | yes | yes | no | no | no | yes | |
adaptive classification | yes | yes | no | no | no | no | |
truncated margins | yes | no | yes | no | no | yes |
License (except test directory): GNU GPLv2, LGPLv2.1, and BSD; or e-mail me for other licenses you want.
NOTE: The distribution includes in test directory the labeled examples for a transition-based dependency parser and hyponymy relation identification, all of which are used in the paper.
> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/opal/opal-latest.tar.gz
> tar zxvf opal-latest.tar.gz
> cd opal-YYYY-MM-DD
> configure
> make install
For Mac OS X users: try port opal-ml via MacPorts (special thanks to @hjym_u).
In the paper, PA-I SL
, PA-I SL*
, PA-I SP
and PA-I KERNEL
refer to opal [-k -p]
, opal [-p]
, opal
, and opalk
, respectively.
rand_shuf
, intended for users to shuffle training data on disk prior to training from them [-b 1,2].configure --disable-array-trie
).configure --enable-float
--enable-string-feature
) is not compatible with this version (thanks: Dr. Shinzato). This is because a double-array trie library used to store string features has been updated not to accept empty strings. To make a model trained with the previous version compatible to this version, run 'sed -i.bak -e 's/^$/ /' model.features'
.[-n]
(configure --enable-openmp
).configure --enable-string-feature
).[-o]
instead of [-O]
).configure --with-trie-impl=map
.configure --enable-temporal-pruning
).[-d 1,2]
).[-t 0]
by implementing raw strtol to read feature vectors.[-f]
.[-N]
(esp. [-t 1 -d 3]
), by using popcnt (implemented by SSE4.2 or look-up table) to compute inner products.[-t 1]
and [-t 0 -s]
) and reduce memory footprint by revisiting the implementation.[-P]
to support probability output (binary classifier only).[-O]
to output labels/classifier_outputs/probabilities.[-p]
in testing if needed exact margin [-O 3,4][-c]
if 0 is set to it (experimental).[-t 0]
[-k -p]
.[-k]
and terminating margin computations [-p]
N
[# features expanded].std::strtol()
in a double array to speed up reading examples.std::random_shuffle()
, though).random()
with XorShift RNG (also set as default RNG instead of TR1 mt19937) for shuffling examples (results updated).command-line options
Typing ./opal --help
shows the following usage information; test/
includes training and testing data used in the paper [1].
opal - online learning with kernel slicing
Copyright (c) 2009-2013 Naoki Yoshinaga, All rights reserved.
Usage: src/opal [options] train model test
train training file set '-' to skip training
model model file set '-' to training/test w/o saving a model
test test file set '-' to skip testing
Optional parameters in training and testing:
-t, --kernel=TYPE select type of kernel function
* 0 - linear (w^T * x)
1 - polynomial (s^T * x + 1)^d
-d, --kernel-degree=INT parameter d in polynomial kernel (0)
--kernel-splitN=INT # common features in kernel splitting (0)
--max-trie-size=INT adjust # common features to fit the feature trie
within RAM size (MiB) (32)
-p, --pruning-margin terminate margin computation if unnecessarily
-k, --kernel-slicing perform kernel slicing
-O, --output=TYPE select output type of testing
* 0 - report accuracy
1 - report accuracy per iteration
2 - labels
3 - labels and margins
4 - labels and probabilities
-o, --output!=TYPE output examples with labels/margins/probabilities
Optional parameters in training:
-l, --learner=TYPE select learning algorithm
0 - Perceptron
1 - Passive Aggressive (PA)
* 2 - Passive Aggressive I (PA-I)
3 - Passive Aggressive II (PA-II)
-c, --reg-cost=FLOAT PA-I/-II aggressiveness parameter C (1.0)
([i * avg. k(x,x)]^-1 if C is set to 0)
-i, --iteration=INT # iterations (10)
-a, --averaging average parameters
-s, --shuffling shuffle training examples on RAM
-b, --buffer=TYPE select type of buffer to read examples
* 0 - Use RAM to load examples
1 - Use DISK to cache examples
2 - Do not cache examples
--model0=FILE re-train a model trained w/ opal ('-')
--feat-threshold=INT thresholding features by frequency (1)
-M, --max-examples=INT max. # examples used in training (0: all)
-P, --probability-output perform sigmoid fitting to output probabilities
Misc.:
-h, --help show this help and exit
file format for train / test
(BNF-like notation)
<class> := +1 | -1 (or any string starting w/o '#' in multi-class classification)
<feature> := integer // >=1
<value> := string (w/o ' ' and '\t') // value is ignored (always assumed to be 1)
<line> := <class> <feature>:<value> <feature>:<value> ... <feature>:<value> | # ...
Refer to test/*.{train,dev,test}
. As in SVMlight, feature/value pairs must be ordered by increasing feature number (values are ignored and assumed to be 1).
If you want to use a string feature, configure --enable-string-feature
then you can use any string (without spaces and ':') as a feature (and you can omit a feature value). Namely,
(BNF-like notation)
<class> := +1 | -1 (or any string starting w/o '#' in multi-class classification)
<feature> := string (w/o ' ', '\t' and ':')
<value> := string (w/o ' ' and '\t') // value is ignored (always assumed to be 1)
<line> := <class> <feature>[:<value>] <feature>[:<value>] ... <feature>[:<value>] | # ...
example
# linear training w/ Averaged Perceptron [-l 0, -a] with shuffling examples
> opal -t 0 -l 0 -c 1.0 -i 5 -as train model -
# linear training and testing w/ Passive Aggressive-I w/o saving model
> opal -t 0 -c 0.005 -i 20 train - test
# kernel-based training and testing (polynomial kernel of d=2) with Passive Aggressive-I
# (event on Disk; shuffling ignored)
> opal -t 1 -d 2 -i 20 -c 0.0005 -a -b 1 train model test
# kernel-based testing
# (only combinations among 500 common features are explicitly considered)
> opal --kernel-splitN 500 - model test
NOTE:
model='-'
) with PA-I [-l 2]
or PA-II [-l 3]
and polynomial kernel [-t 1]
, the reported accuracy can be slightly different from the one obtained with non-instant mode; this is due to rounding errors.[-m model0]
) with averaging ignores #examples processed in a prior training (e.g., with [-a]
, a model trained with [-i 20]
will be different from a model retrained with [-i 10]
from a model trained with [-i 10]
).The following tables show a comparison among linear learners (opal; ver. in 2011, oll 0.2, AROW++ 0.05, libarow 0.2.0, Vowpal Wabbit 4.1, SVMlin 1.0 and LIBLINEAR 1.7) and a comparison among kernel-based learners (opal, LIBLINEAR-POLY2 1.7 ([-d 2]
only) and TinySVM 0.09), and LIBSVM 3.0: for LIBLINEAR, results with the fastest algorithm and the algorithm that obtained the most accurate model are shown. pecco (FST [-t 1]
) is also used to test the models obtained with opal ([-d 2,3]
; fstrie is build from the training data). The experiments were conducted on Mac OS X 10.5 over Intel Xeon E5462 3.2Ghz CPU with 32GB main memory. Only oll uses 4-byte float as floating-point numbers (others use 8-byte double). AROW++ and libarow cannot output margins of examples (others (can) do). opal, LIBLINEAR-POLY2 (modified to handle a large number of conjunctive features; int->int64_t) and TinySVM (modified for the experiments in the paper) and LIBSVM (add -m64, to make speed and memory comparable to 64-bit TinySVM) adopt 64-bit addressing, while others adopt 32bit addressing. These differences may cause 10-20% difference in time, and at most 100% difference in space (32bit v.s. 64bit). opal with [-b 1]
and Vowpal Wabbit do not load events on memory; this leads to slower but memory-efficient training (demand additional Disk space for caching data, though); in the experiments, examples for these learners were offline shuffled to make the accuracy of the obtained models comparable to the others.
We have tried parameters # iters = 1, 5, 10, 20 (online learners only), and c/r = 100.0, 50.0, ..., 0.000005, 0.000001; the table lists the model that achieved the best accuracy on the development set. Due to intricate parameters related to learning rate in SGD, we have only tried l= 1.0, 0.5, 0.2, .., 0.02, 0.01 for Vowpal Wabbit. The gettimeofday
command is used to measure the run-time (namely, the whole training or testing time).
Linear training:
Software | Algorithm | d | # iters | c (r), l | Acc. (%) | Mem [MiB] | Training [s] | Test [s] | # SVs |
---|---|---|---|---|---|---|---|---|---|
opal | P [-l 0 -as] | - | 10 | - | 90.76 | 49.9 | 1.1 | 0.10 | - |
opal | PA-I [-l 2 -s] | - | 20 | 0.005000 | 90.90 | 46.1 | 1.7 | 0.10 | - |
opal | PA-I [-l 2 (-s) -b 1] | - | 20 | 0.005000 | 90.90 | 8.6 | 2.2 | 0.10 | - |
opal | PA-I [-l 2 -as] | - | 20 | 0.010000 | 91.21 | 49.9 | 2.0 | 0.10 | - |
oll | AP | - | 5 | - | 90.82 | 126.8 | 17.3 | 6.08 | - |
oll | PA-I | - | 20 | 0.005000 | 91.06 | 125.5 | 18.3 | 6.07 | - |
oll | CW | - | 1 | 0.000050 | 88.83 | 126.8 | 17.2 | 6.13 | - |
AROW++ | AROW | - | 5 | 50.000000 | 90.41 | 186.7 | 16.9 | 5.33 | - |
libarow | AROW | - | 10 | 100.000000 | 90.48 | 149.2 | 4.9 | - | |
Vowpal Wabbit | SGD | - | 20 | 0.500000 | 91.01 | 21.4 | 161.6 | 4.11 | - |
SVMlin | L2-SVM-MFN [-A 1] | - | 8 | 0.100000 | 90.94 | 132.3 | 126.3 | 1.24 | - |
LIBLINEAR | L1-reg L2-SVM [-s 5] | - | 19 | 0.100000 | 90.84 | 219.1 | 4.5 | 0.65 | - |
LIBLINEAR | L2-reg L1-SVM (dual) [-s 3] | - | 285 | 0.500000 | 91.10 | 114.4 | 12.7 | 0.68 | 77700 |
Kernel-based training (polynomial kernel):
Software | Algorithm | d | # iters | c | Acc. (%) | Mem [MiB] | Training [s] | Test [s] | # SVs |
---|---|---|---|---|---|---|---|---|---|
opal (+ pecco) | P [-l 0 --kernel-splitN 2000 -p -as] | 2 | 5 | - | 92.94 | 72.1 | 3.5 | 0.72 (0.32) | 59427 |
opal (+ pecco) | PA-I [-l 2 --kernel-splitN 2000 -p -s] | 2 | 20 | 0.000100 | 93.03 | 79.7 | 13.5 | 0.90 (0.35) | 94820 |
opal (+ pecco) | PA-I [-l 2 --kernel-splitN 4000 -p -as] | 2 | 20 | 0.000500 | 93.20 | 84.6 | 14.1 | 1.19 (0.35) | 83867 |
LIBLINEAR-POLY2 | L2-reg L1-SVM (dual) [-s 3] | 2 | 66 | 0.005000 | 93.24 | 16013.8 | 422.8 | 310.27 | 66056 |
TinySVM (+ pecco) | SVM | 2 | 438847 | 0.005000 | 93.24 | 246.4 | 30040.3 | 1739.92 (0.33) | 65158 |
LIBSVM (+ pecco) | SVM | 2 | 261916 | 0.005000 | 93.24 | 266.7 | 23302.2 | 1694.98 (0.33) | 65104 |
opal (+ pecco) | P [-l 0 --kernel-splitN 250 -kp -as] | 3 | 5 | - | 93.07 | 151.2 | 16.1 | 2.16 (1.17) | 57295 |
opal (+ pecco) | PA-I [-l 2 --kernel-splitN 250 -kp -s] | 3 | 20 | 0.000005 | 93.32 | 164.1 | 84.6 | 2.89 (1.29) | 89651 |
opal (+ pecco) | PA-I [-l 2 --kernel-splitN 250 -kp -as] | 3 | 20 | 0.000010 | 93.36 | 178.9 | 80.7 | 2.83 (1.26) | 85863 |
TinySVM (+ pecco) | SVM | 3 | 306291 | 0.000100 | 93.38 | 244.8 | 26048.6 | 1896.05 (1.19) | 68530 |
LIBSVM (+ pecco) | SVM | 3 | 202741 | 0.000100 | 93.38 | 266.1 | 22457.0 | 1785.52 (1.20) | 68499 |
NOTE: the above accuracy is computed against the test examples, and is not directly comparable to those in the paper, which shows dependency accuracy.
Linear training:
Software | Algorithm | d | # iters | c (r), l | Acc. (%) | Mem [MiB] | Training [s] | Test [s] | # SVs |
---|---|---|---|---|---|---|---|---|---|
opal | P [-l 0 -as] | - | 10 | - | 91.74 | 22.4 | 0.6 | 0.02 | - |
opal | PA-I [-l 2 -s] | - | 20 | 0.0100 | 92.08 | 20.7 | 1.0 | 0.02 | - |
opal | PA-I [-l 2 (-s) -b 1] | - | 20 | 0.0100 | 92.08 | 4.4 | 1.2 | 0.02 | - |
opal | PA-I [-l 2 -as] | - | 20 | 0.0500 | 91.74 | 22.9 | 1.2 | 0.02 | - |
oll | AP | - | 20 | - | 91.20 | 40.5 | 7.5 | 0.34 | - |
oll | PA-I | - | 20 | 0.0100 | 91.89 | 39.4 | 7.5 | 0.35 | - |
oll | CW | - | 1 | 0.5000 | 91.02 | 40.6 | 6.6 | 0.35 | - |
AROW++ | AROW | - | 20 | 50.0000 | 91.60 | 59.9 | 9.3 | 0.30 | - |
libarow | AROW | - | 10 | 50.0000 | 91.62 | 68.4 | 2.0 | - | |
Vowpal Wabbit | SGD | - | 20 | 1.0000 | 91.65 | 21.8 | 110.8 | 0.38 | - |
SVMlin | L2-SVM-MFN [-A 1] | - | 11 | 0.5000 | 91.84 | 61.8 | 104.2 | 0.17 | - |
LIBLINEAR | L2-reg L2-SVM (dual) [-s 1] | - | 21 | 0.1000 | 91.72 | 49.7 | 2.8 | 0.25 | 110084 |
LIBLINEAR | L2-reg L2-SVM (primal) [-s 2] | - | 40 | 0.5000 | 91.76 | 65.8 | 33.3 | 0.33 | - |
Kernel-based training (polynomial kernel):
Software | Algorithm | d | # iters | c | Acc. (%) | Mem [MiB] | Training [s] | Test [s] | # SVs |
---|---|---|---|---|---|---|---|---|---|
opal (+ pecco) | P [-l 0 --kernel-splitN 2000 -p -as] | 2 | 10 | - | 92.78 | 56.5 | 4.0 | 0.30 (0.15) | 42261 |
opal (+ pecco) | PA-I [-l 2 --kernel-splitN 2000 -p -s] | 2 | 20 | 0.0010 | 92.89 | 64.6 | 14.8 | 0.53 (0.23) | 81429 |
opal (+ pecco) | PA-I [-l 2 --kernel-splitN 2000 -p -as] | 2 | 20 | 0.0050 | 92.93 | 62.2 | 11.9 | 0.46 (0.20) | 67919 |
TinySVM (+ pecco) | SVM | 2 | 258081 | 0.0100 | 92.85 | 139.8 | 12378.3 | 59.60 (0.22) | 77151 |
LIBSVM (+ pecco) | SVM | 2 | 205993 | 0.0100 | 92.85 | 175.0 | 16695.3 | 110.14 (0.22) | 77240 |
opal (+ pecco) | P [-l 0 --kernel-splitN 125 -kp -as] | 3 | 10 | - | 92.89 | 113.5 | 6.2 | 0.46 (0.29) | 35902 |
opal (+ pecco) | PA-I [-l 2 --kernel-splitN 250 -kp -s] | 3 | 20 | 0.0005 | 92.99 | 152.7 | 11.6 | 0.64 (0.35) | 56772 |
opal (+ pecco) | PA-I [-l 2 --kernel-splitN 250 -kp -as] | 3 | 20 | 0.0005 | 93.07 | 153.1 | 11.8 | 0.65 (0.36) | 56772 |
TinySVM (+ pecco) | SVM | 3 | 351758 | 0.0100 | 93.09 | 140.8 | 17403.3 | 79.88 (0.50) | 91873 |
LIBSVM (+ pecco) | SVM | 3 | 239881 | 0.0100 | 93.09 | 174.7 | 18082.2 | 129.64 (0.50) | 92009 |
NOTE: LIBLINEAR-POLY2 did not work in this dataset (it should require ~400G memory space).
You may want to consider the following optimization strategy for tuning parameter [-c]
and the number of iterations [-i]
in the default learner PA-I [-l 1]
. Keeping other options fixed as default (other than parameters mentioned below), first tune [-c]
on development set, and then halve [-c]
and double [-i]
to conduct a more cautious optimization. Following this simple procedure, You can easily obtain a model whose accuracy is comparable to those trained with batch learning.
Averaging [-a]
is must for Perceptron [-l 0]
while optional for variants of PA [-l 1,2,3]
; however, I still recommend to set [-a]
in PA because averaging allows us to reduce the number of iterations needed to reach the best performance. Shuffling [-s]
is must if the training examples are sorted by some biased order, otherwise optional (generally, recommended). I usually unset [-a]
or [-s]
in a task where neighboring examples are closely related with each other (e.g., deterministic dependency parsing); in such a case, shuffling just slows down the training, and the accuracy of the resulting model could (slightly) drop.
Setting [-p]
or [-kp]
is highly recommended to speed up training with [-t 1 -d 2]
or [-t 1 -d 3]
, respectively. Remember that when you use variants of PA [-l 1,2,3]
as a learning algorithm, kernel slicing [-k]
will introduce some `negligible' rounding errors (meaning the resulting model is not identical in practice though theoretically the same), while truncating margin computations [-p]
is unlikely to suffer from rounding errors (You can usually get the identical model). Never mind, anyway.
Finally, you do not need to worry about tuning parameter [--kernel-splitN]
; otherwise the memory footprint will be adjusted according to the value of [-T]
. If you plan to perform training several times with the same dataset, you may want to set [--kernel-splitN]
to the value finally reached in the first training; this will slightly speed up the training.
We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.
Opal was originally meant for online passive-aggressive algorithms; the efficiency attributes to obsolete partial-margins in the active lifespan.