HMP: Hidden Markov Perceptron for Large-scale Sequence Labeling

HMP is an implementation of perceptron training and staggered decoding for labeling sequential data [Collins2002, Kaji2010]. It offers fast and exact training/decoding when dealing with a large number of labels. HMP is designed for general purpose. If you are looking for POS tagger or chunker, you may find this useful (All the libraries available here are distributed under the GNU General Public License).

Install

Usage

Data format

Here is an example of training/testig data.

 % head test-conll
 U0:Rockwell      U1:R U2:Ro U3:Roc U4:l U5:ll U6:ell U7:CAP B0 B1:Rockwell/International NNP
 U0:International U1:I U2:In U3:Int U4:l U5:al U6:nal U7:CAP B0 B1:International/Corp.    NNP
 U0:Corp.         U1:C U2:Co U3:Cor U4:. U5:p. U6:rp. U7:CAP B0 B1:Corp./'s               NNP
 U0:'s            U1:' U2:'s *      U4:s U5:'s *      *      B0 B1:'s/Tulsa               POS
 U0:Tulsa         U1:T U2:Tu U3:Tul U4:a U5:sa U6:lsa U7:CAP B0 B1:Tulsa/unit             NNP
 U0:unit          U1:u U2:un U3:uni U4:t U5:it U6:nit *      B0 B1:unit/said              NN
 U0:said          U1:s U2:sa U3:sai U4:d U5:id U6:aid *      B0 B1:said/it                VBD
 U0:it            U1:i U2:it *      U4:t U5:it *      *      B0 B1:it/signed              PRP
 U0:signed        U1:s U2:si U3:sig U4:d U5:ed U6:ned *      B0 B1:signed/a               VBD
 U0:a             U1:a *     *      U4:a *    *       *      B0 B1:a/tentative            DT
 ...
 U0:jetliners     U1:j U2:je U3:jet U4:s U5:rs U6:ers *      B0 B1:jetliners/.            NNS
 U0:.             U1:. *     *      U4:. *     *      *      *  *                         .
 [\n]
 U0:Rockwell      U1:R U2:Ro U3:Roc U4:l U5:ll U6:ell U7:CAP B0 B1:Rockwell/said          NNP

Each line consists of attributes (e.g., U0:Rockewll and U1:R) followed by a label (e.g., NNP). Attributes appearing in the same column are of the same type (e.g., the first column is word, and the second column is prefix of length one). If an attribute of a specific type is not observed, it is specified as *. The end of a sequece is indicated by an empty line.

Feature functions are automatically generated from the attributes observed in training data. Attributes that begin with 'U' are used to produce unigram feature functions. Here, unigram feature functions are indicators of attribute-label pairs. Similarly, attributes beginning with 'B' produce bigram feature functions, which check the combination of the attribute, the current label, and the following label.

Training

hmp_train command offers averaged perceptron training:

  % ./hmp_train train model

In the defalut setting, model is averaged over 10 iterations. To specify the iteration number, use -i option:

  % ./hmp_train -i 20 train model

For other options, just type hmp_train.

Decoding

Use hmp_test command for labeling test data:

 % ./hmp_test model test
 # Score=313.486145 IterNum=7     // The best score and #iterations
 UH                               // The most probable label for the first token.
 ,                                // The most probable label for the second token.
 PRP                              // ...
 VBD
 RB
 NNP
 NNP
                                  //  This is the end of the sentence.
 # Score=1922.747681 IterNum=6    //  The next sentence starts.

Dumping

hmp_dump command converts the binary model file into a readable format:

  % ./hmp_dump model model.dump

History

Refereces

  1. Michael Collins.
    Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceprton Algorithms.
    In Proc. of EMNLP, pp1--8. Philadelphia, Pennsylvania, USA. July, 2002.

  2. Roberto Esposito and Daniele P. Radicioni.
    CarpeDiem: an Algorithm for the Fat Evaluation of SSL Classifiers.
    In Proc. of ICML, pp157--264. Corvallis, Oregon, USA. June, 2007.

  3. Roberto Esposito and Daniele P. Radicioni.
    CarpeDiem: Optimizaing the Viterbi Algorithm and Appplications to Supervised. Sequence Laearning.
    Journal of Machine Learning Research, Vol.10, pp1851--1880. 2009.

  4. Nobuhiro Kaji, Yasuhiro Fujiwara, Naoki Yohinaga, and Masaru Kitsuregawa.
    Efficient Staggered Decoding for Sequence Labeling.
    In Proc. of ACL. pp485--494. Uppsala, Sweden. July, 2010.