Pecco is a simple C++ implementation of a linear classifier based on feature sequence trie (fstrie) [1]. In a nutshell, pecco enumerates fundamental classification problems (partial feature vectors that commonly appear) in the target task, and computes their scores in advance (stored in fstrie). The classifier can efficiently calculate a score for an input problem (feature vector) by computing a margin from the pre-calculated score of a fundamental problem that is most similar to the input.
Pecco interprets a model trained with binary conjunctive features (or polynomial kernel), and performs a quick classification for a given binary feature vector. Experimental results on a dependency parsing task showed that pecco was faster than a common linear classifier and a kernel-based classifier by a factor of 2-13 and 30-500, respectively [1].
[-t 1], SPLIT [3] [-t 0 -r r (>0)], PKE [2] [-t 0 -s s (>0)], and PKI [2] [-t 2] are available for kernel-based models, while FST and PKE are available for LLMs.
If you need the fastest classifier allowing approximation, simply run FST with positive PKE sigma (set [-s] option); a larger sigma results in a faster but less accurate classifier. Or if you want to guarantee the exactness of classifier outputs, run FST with positive SPLIT threshold (set [-r] option); you can speed up some speed-up without losing the exactness, although the resulting classifier becomes slower when the threshold exceeds some optimal value. In both cases you can control the memory footprint for FST by setting fstrie compression ratio (set [-i n]; use 1/2n of fundamental classification problems).
The code is ready; the libraries are distributed under the GNU General Public License (e-mail me when you prefer some license other than GNU GPL). Note that this license does not apply to the model files in test/ directory, which are originally generated from the annotated data by the way described in [1]. The main program for building an fstrie and the fstrie-based classifier is in essence 100-200 C++ lines; C++ experts might replicate a more efficient as well as readable program than the C++ beginner (I did my best, though).
NOTE: The following package includes the labeled data for dependency parsing and LLMs and SVMs trained with them, both of which are identical to the ones used in our EMNLP paper [1].
> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/pecco/pecco-latest.tar.bz2 > tar jxvf pecco-latest.tar.bz2 > cd pecco > vi pecco_conf.h # set USE_DARTS_CLONE (unset USE_CEDAR) to reduce memory footprint (darts-clone required) > vi Makefile > make install
Alternatively, Mac OS X users can exploit macports to install pecco (special thanks to @hjym_u); darts-clone (darts.h) will be used to store weights of conjunctive features and feature sequences.
command-line options
Typing ./pecco --help shows the following usage information; pecco/test/ includes models used in [1].
pecco - please enjoy classification w/ conjunctive features
Copyright (c) 2008-2011 Naoki Yoshinaga, All rights reserved.
Usage: pecco [options] model [test]
model model file model
test test file test examples; read STDIN if omitted
Optional parameters:
-t, --classifier=TYPE select classifier type
* 0 - PKE | SPLIT
1 - FST
2 - PKI (kernel model only)
-e, --event=FILE examples to obtain feature count for reordering ()
-f, --fst-event=FILE examples to enumerate feature sequence in FST ()
-i, --fst-prune-factor=INT use FST with 2^-INT of feature sequences (0)
-b, --fst-build-verbose build FSTs with 2^-i (i > 0) of feature sequences
Optional parameters in kernel model:
-s, --pke-sigma=FLOAT threshold to feature weight (0)
-r, --split-ratio=FLOAT threshold to feature frequency ratio (0)
Misc.:
-v, --verbose=INT verbosity level (1)
-h, --help show this help and exit
file format for [-e] and [-f] options
(BNF-like notation) <class> := +1 | -1 <feature> := integer // >=1 <value> := 1 // value will be ignored <line> := <class> <feature>:<value> <feature>:<value> ... <feature>:<value>
example
# analyze examples in test using an PKE classifier
# (-s: threshold \sigma to prune insignificant features [2])
> pecco -t 0 -s 0.01 -v 1 model test
# analyze examples in test using a SPLIT classifier
> pecco -t 0 -r 0.01 -v 1 -e train model test
# analyze examples in test using an FST (+ SPLIT) classifier with event
# (we can guide ordering features by giving reference ([-e]) to obtain feature
# counts; this would be useful if train/test have different feature distribution
> pecco -t 1 -r 0.01 -v 1 -e reference -f event model test
# analyze examples in test using an FST classifier with a pruned fstrie
# (the pruning scheme in [1] reduces # feature sequences (to build fstrie) to 1/25)
> pecco -t 1 -i 5 -e train -f event -v 1 model test
See the performance comparison section in opal.
We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.
Read as you want; one may imagine a naming like ``Portable Efficient Classifier with COnjunctive features'', but I just think ``Please Enjoy Classification with COnjunctive features''. The naming comes into my mind when I felt hungry (hara-peco in Japanese), and you can call the software `pecco-peco' when you are really hungry.