pecco - simple C++ library for linear classification with conjunctive features

developed by Naoki Yoshinaga at Kitsuregawa, Toyoda Lab., IIS, University of Tokyo
Quick Link: [ What is this? | Download & Setup | Usage | Performance | References ]

What is this?

Pecco is a simple C++ implementation of a linear classifier based on feature sequence trie (fstrie) [1]. In a nutshell, pecco enumerates fundamental classification problems (partial feature vectors that commonly appear) in the target task, and computes their scores in advance (stored in fstrie). The classifier can efficiently calculate a score for an input problem (feature vector) by computing a margin from the pre-calculated score of a fundamental problem that is most similar to the input.

Pecco interprets a model trained with binary conjunctive features (or polynomial kernel), and performs a quick classification for a given binary feature vector. Experimental results on a dependency parsing task showed that pecco was faster than a common linear classifier and a kernel-based classifier by a factor of 2-13 and 30-500, respectively [1].

If you need the fastest classifier allowing approximation, simply run FST with positive PKE sigma (set [-s] option); a larger sigma results in a faster but less accurate classifier. Or if you want to guarantee the exactness of classifier outputs, run FST with positive SPLIT threshold (set [-r] option); you can speed up some speed-up without losing the exactness, although the resulting classifier becomes slower when the threshold exceeds some optimal value. In both cases you can control the memory footprint for FST by setting fstrie compression ratio (set [-i n]; use 1/2n of fundamental classification problems).

The code is ready; the libraries are distributed under the GNU General Public License (e-mail me when you prefer some license other than GNU GPL). Note that this license does not apply to the model files in test/ directory, which are originally generated from the annotated data by the way described in [1]. The main program for building an fstrie and the fstrie-based classifier is in essence 100-200 C++ lines; C++ experts might replicate a more efficient as well as readable program than the C++ beginner (I did my best, though).

NOTE: The following package includes the labeled data for dependency parsing and LLMs and SVMs trained with them, both of which are identical to the ones used in our EMNLP paper [1].

Download & Setup

> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/pecco/pecco-latest.tar.bz2
> tar jxvf pecco-latest.tar.bz2
> cd pecco
> vi pecco_conf.h # set USE_DARTS_CLONE (unset USE_CEDAR) to reduce memory footprint (darts-clone required)
> vi Makefile
> make install

Alternatively, Mac OS X users can exploit macports to install pecco (special thanks to @hjym_u); darts-clone (darts.h) will be used to store weights of conjunctive features and feature sequences.

ToDo

History

Requirements

Usage

command-line options

Typing ./pecco --help shows the following usage information; pecco/test/ includes models used in [1].

pecco - please enjoy classification w/ conjunctive features
Copyright (c) 2008-2011 Naoki Yoshinaga, All rights reserved.

Usage: pecco [options] model [test]

model   model file             model
test    test file              test examples; read STDIN if omitted

Optional parameters:
  -t, --classifier=TYPE        select classifier type
                               * 0 - PKE | SPLIT
                                 1 - FST
                                 2 - PKI (kernel model only)
  -e, --event=FILE             examples to obtain feature count for reordering ()
  -f, --fst-event=FILE         examples to enumerate feature sequence in FST ()
  -i, --fst-prune-factor=INT   use FST with 2^-INT of feature sequences (0)
  -b, --fst-build-verbose      build FSTs with 2^-i (i > 0) of feature sequences

Optional parameters in kernel model:
  -s, --pke-sigma=FLOAT        threshold to feature weight (0)
  -r, --split-ratio=FLOAT      threshold to feature frequency ratio (0)

Misc.:
  -v, --verbose=INT           verbosity level (1)
  -h, --help                  show this help and exit

file format for [-e] and [-f] options

(BNF-like notation)
<class>   := +1 | -1
<feature> := integer // >=1
<value>   := 1       // value will be ignored
<line>    := <class> <feature>:<value> <feature>:<value> ... <feature>:<value>

example

# analyze examples in test using an PKE classifier
#   (-s: threshold \sigma to prune insignificant features [2])
> pecco -t 0 -s 0.01 -v 1 model test

# analyze examples in test using a SPLIT classifier
> pecco -t 0 -r 0.01 -v 1 -e train model test

# analyze examples in test using an FST (+ SPLIT) classifier with event
# (we can guide ordering features by giving reference ([-e]) to obtain feature
#  counts; this would be useful if train/test have different feature distribution
> pecco -t 1 -r 0.01 -v 1 -e reference -f event model test

# analyze examples in test using an FST classifier with a pruned fstrie
# (the pruning scheme in [1] reduces # feature sequences (to build fstrie) to 1/25)
> pecco -t 1 -i 5 -e train -f event -v 1 model test

Performance Comparison

See the performance comparison section in opal.

Disclaimer

We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.

How to pronounce `pecco'

Read as you want; one may imagine a naming like ``Portable Efficient Classifier with COnjunctive features'', but I just think ``Please Enjoy Classification with COnjunctive features''. The naming comes into my mind when I felt hungry (hara-peco in Japanese), and you can call the software `pecco-peco' when you are really hungry.

References

  1. N. Yoshinaga and M. Kitsuregawa. Polynomial to Linear: Efficient Classification with Conjunctive Features. Proc. EMNLP, pp. 1542--1551. 2009. See a journal paper (draft; notice of use) for our classifier combined with [3].
  2. T. Kudo and Y. Matsumoto. Fast Methods for Kernel-Based Text Analysis. Proc. ACL, pages 24--31. 2003.
  3. Y. Goldberg and M. Elhadad. splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL, short papers, pages 237--240. 2008.

Copyright © 2009 - 2011 Naoki Yoshinaga, All right Reserved.
XHTML 1.1 $ Last modified at Sat Dec 31 18:52:09 2011 $