HMP is an implementation of perceptron training and staggered decoding for labeling sequential data [Collins2002, Kaji2010]. It offers fast and exact training/decoding when dealing with a large number of labels. HMP is designed for general purpose. If you are looking for POS tagger or chunker, you may find this useful (All the libraries available here are distributed under the GNU General Public License).
% tar -zxvf hmp-XXX.tgz % cd hmp-XXX % make
Here is an example of training/testig data.
% head test-conll U0:Rockwell U1:R U2:Ro U3:Roc U4:l U5:ll U6:ell U7:CAP B0 B1:Rockwell/International NNP U0:International U1:I U2:In U3:Int U4:l U5:al U6:nal U7:CAP B0 B1:International/Corp. NNP U0:Corp. U1:C U2:Co U3:Cor U4:. U5:p. U6:rp. U7:CAP B0 B1:Corp./'s NNP U0:'s U1:' U2:'s * U4:s U5:'s * * B0 B1:'s/Tulsa POS U0:Tulsa U1:T U2:Tu U3:Tul U4:a U5:sa U6:lsa U7:CAP B0 B1:Tulsa/unit NNP U0:unit U1:u U2:un U3:uni U4:t U5:it U6:nit * B0 B1:unit/said NN U0:said U1:s U2:sa U3:sai U4:d U5:id U6:aid * B0 B1:said/it VBD U0:it U1:i U2:it * U4:t U5:it * * B0 B1:it/signed PRP U0:signed U1:s U2:si U3:sig U4:d U5:ed U6:ned * B0 B1:signed/a VBD U0:a U1:a * * U4:a * * * B0 B1:a/tentative DT ... U0:jetliners U1:j U2:je U3:jet U4:s U5:rs U6:ers * B0 B1:jetliners/. NNS U0:. U1:. * * U4:. * * * * * . [\n] U0:Rockwell U1:R U2:Ro U3:Roc U4:l U5:ll U6:ell U7:CAP B0 B1:Rockwell/said NNP
Each line consists of attributes (e.g., U0:Rockewll and U1:R) followed by a label (e.g., NNP). Attributes appearing in the same column are of the same type (e.g., the first column is word, and the second column is prefix of length one). If an attribute of a specific type is not observed, it is specified as *. The end of a sequece is indicated by an empty line.
Feature functions are automatically generated from the attributes observed in training data. Attributes that begin with 'U' are used to produce unigram feature functions. Here, unigram feature functions are indicators of attribute-label pairs. Similarly, attributes beginning with 'B' produce bigram feature functions, which check the combination of the attribute, the current label, and the following label.
hmp_train command offers averaged perceptron training:
% ./hmp_train train model
In the defalut setting, model is averaged over 10 iterations. To specify the iteration number, use -i option:
% ./hmp_train -i 20 train model
For other options, just type hmp_train.
Use hmp_test command for labeling test data:
% ./hmp_test model test # Score=313.486145 IterNum=7 // The best score and #iterations UH // The most probable label for the first token. , // The most probable label for the second token. PRP // ... VBD RB NNP NNP // This is the end of the sentence. # Score=1922.747681 IterNum=6 // The next sentence starts.
hmp_dump command converts the binary model file into a readable format:
% ./hmp_dump model model.dump