J.DepP is a C++ implementation of Japanese dependency parsing algorithms [1,2,3,4]. The parser takes a raw sentence as input and performs word segmentation, POS tagging (thanks to JUMAN or MeCab), bunsetsu chunking and dependency parsing. J.DepP is meant for those who want to parse massive texts (e.g., entire blog feeds) efficiently without compromising state-of-the-art accuracy.
Main features of J.DepP are as follows:
[-p 0] using default parameters achieves state-of-the-art dependency accuracy (partial: ~91.7%; complete ~56.5%, on news domain). Powered by an efficient classifier, pecco, J.DepP processes ~10.4K (K = 1024) raw sentences (in news domain, ~16.5K for blog feeds) per second on 11-inch MacBook Air (Mid 2011) with 1.8 GHz Intel Core i7 CPU (cf. MeCab 0.99pre1 analyzes around ~26.6K raw sentences (in news domain, ~33.6K for blog feeds) per second). Powered by an efficient learner, opal [-l 0], you can obtain this parser in a few minutes; training the chunker and dependency parser from the training split of Kyoto Text University Corpus takes just 19.0s and 87.0s on the same MacBook Air, respectively.NOTE: This software was originally implemented for me to practice C++ programming, so some part of the current codes still remains messy. I yet release this to demonstrate how to use C++ APIs of opal and pecco [9].
The code is ready; the libraries are distributed under the GNU General Public License (e-mail me when you prefer some license other than GNU GPL).
> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/pecco/pecco-latest.tar.bz2
> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/opal/opal-latest.tar.bz2 // jdepp_pa
> wget http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/maxent/maxent-3.0.tar.gz // jdepp_me
> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/jdepp-latest.tar.bz2
> echo {pecco,opal,jdepp}-latest.tar.bz2 | xargs -n 1 tar jxvf
> tar zxvf maxent-3.0.tar.gz
> cd jdepp
> ln -sf ../{pecco,opal}/*.{h,cc} .
> ln -s ../maxent-3.0/*.{h,cpp} .
> vi pecco_conf.h # set USE_DARTS_CLONE (unset USE_CEDAR) to reduce memory footprint (darts-clone required)
# set USE_FLOAT and ABUSE_TRIE to reduce parsing time by 15% and the required memory by 25%
> vi opal_conf.h
> vi jdepp_conf.h # unset USE_AS_STANDALONE if you want to build J.DepP that takes POS-tagged sentences
> vi Makefile
> make install
to_tree.py)replace_pos.py generates training data with auto POSs from the one with gold POSs).[-meopt=SGD|OWLQN|LBFGS].[-oopt=P|PA1|PA2, -oave=0|1].--' in command-line).Typing ./jdepp_[pa|svm|me] -h shows the following usage information. The chunker/parser can process sentences in any encoding if both jdepp_conf.h and training data are converted into the target encoding (default: EUC-JP; if you use Emacs, modify the first line of ./jdepp_conf.h).
J.DepP - Japanese Dependency Parser
Copyright (c) 2008-2011 Naoki Yoshinaga
Usage: jdepp [options] -- [learner options] -- [chunker classifier options] -- [parser classifier options] < test
test test file
Optional parameters in training / testing:
-t, --type=TYPE select running mode of J.DepP
0 - learn
* 1 - parse
2 - both
3 - cache
-c, --corpus=FILE training corpus in JDEPP format ('train.JDP')
-m, --model-dir=DIR model directory ('model')
-p, --parser=TYPE select parsing algorithm
* 0 - shift reduce
1 - cascaded chunking
2 - backward
3 - tournament
-I, --input-format=TYPE select type of input format
* 0 - RAW sentences
1 - + POS / BUNSETSU annotation
2 - + DEPENDENCY annotation
Optional parameters in training:
-l, --learner=TYPE select type of learning library
* 0 - OPAL
1 - SVM
2 - MaxEnt
-n, --max-sent=INT max. # processing sentences (0: all)
Optional parameters in testing:
-d, --mecab-dic=DIR use MeCab dict in DIR for POS tagging
Misc.:
-v, --verbose=INT verbosity level (1)
-h, --help show this help and exit
Training
# prepare the training data in the JDEPP format (morphological analyzer output + dependency annotation)
# JUMAN
> cat KyotoCorpus4.0/dat/syn/95{01<01-11>,<01-08>ED}.KNP | \
awk '!/^(#|\*|E)/ {$0 = $1" "$2" "($3 == "*" ? $1 : $3)" "$4" 0 "$5" 0 "$6" 0 "$7" 0 NIL"}; 1' > train.JDP
# MeCab/jumandic
> cat KyotoCorpus4.0/dat/syn/95{01<01-11>,<01-08>ED}.KNP | \
awk '!/^(#|\*|E)/ {$0 = $1"\t"$4","$5","$6","$7","($3 == "*" ? $1 : $3)","$2",*"}; 1' > train.JDP
# or you may want to train a parser with auto POSs given by JUMAN or MeCab, the front-end POS tagger
# (the resulting chunker/parser will exhibit a better performance when you input a raw sentence)
# JUMAN
> cat KyotoCorpus4.0/dat/syn/95{01<01-11>,<01-08>ED}.KNP | replace_pos.py juman > train.JDP
# MeCab/jumandic
> cat KyotoCorpus4.0/dat/syn/95{01<01-11>,<01-08>ED}.KNP | replace_pos.py mecab JUMAN -d JUMAN_DIC_DIR > train.JDP
# MeCab/ipadic
> cat KyotoCorpus4.0/dat/syn/95{01<01-11>,<01-08>ED}.KNP | replace_pos.py mecab IPA -d MeCab_IPA_DIC_DIR > train.JDP
# create a directory to save a model
> mkdir model
# train chunker/parser with opal [-l 0], TinySVM [-l 1], or Tsuruoka's MaxEnt [-l 2]
# you can configure [learner options] delegated to the learner
# to see the learner's options, set -h to learner options
> jdepp_pa -t 0 -c train.JDP -I 1 < test.JDP # chunker
> jdepp_pa -t 0 -c train.JDP -I 2 < test.JDP # parser
# typical model hyper-parameters for a chunker
# PA without kernel; ultimately fast but less accurate
> jdepp_pa -t 0 -c train.JDP -I 1 -- -t 0 -c 0.05 -i 40
# default parameters; reasonably fast and enough accurate / recommended
> jdepp_pa -t 0 -c train.JDP -I 1 -- -t 1 -d 2 -c 0.001 -i 40
# typical model hyper-parameters for training a dependency parser
# pa_pl0: (PA with linear kernel; ultimately fast but less accurate)
> jdepp_pa -t 0 -c train.JDP -I 2 -- -t 0 -c 0.001 -i 40
# pa_pl2: (default parameters; reasonably fast and enough accurate / highly recommended)
> jdepp_pa -t 0 -c train.JDP -I 2 -- -t 1 -d 2 -c 0.00005 -i 40
# pa_pt3: (PA1 with -p 3 and -d 3; extremely slow but fairly accurate)
> jdepp_pa -t 0 -c train.JDP -I 2 -- -t 1 -d 3 -c 0.0000001 -i 40
NOTE: The default parameters are tuned for training a shift-reduce parser with opal [-l 0 -p 0]; if you want use other parsing algorithms [-p 1|2|3] or estimators [-l 1|2], you should at least tune their regularization parameters [-c].
Testing
# [optional] build a feature sequence trie [to speed up a parser with d≥2 model] # note: you will gain a significant speed-up only with d>=3 models # 1) apply J.DepP to a part of gigantic data you're going to analyze # [the same format as train.JDP; no correct dependency annotation needed] # 2) pass the data via [-c] to enumerate common feature sequences # (set [classifier option] accordingly) # example (you can first consider the use of the training data): > jdepp_pa -t 3 -c train.JDP -I 1 # (for chunker) > jdepp_pa -t 3 -c train.JDP -I 2 # (for parser) > jdepp_pa -t 3 -c train.JDP -I 2 -- -- -- -t 0 -r 0.005 # (for parser) # you can configure [chunker|parser classifier options] to chunk|parse a sentence # [see web site of pecco for details; or provide -h] # example: # pa_pl0 (when you trained a chunker|parser with [-t 0], classifier options will be ignored) > jdepp_pa < test.sent # pa_pl2 (default PKE classifier; pecco [-t 0 -s 0.02]) > jdepp_pa < test.sent # pa_pl2 (slightly slower SPLIT classifier) > jdepp_pa -- -- -- -t 0 -r 0.005 # pa_pl2 (slightly faster FST classifier [-t 1]) > jdepp_pa -- -- -- -t 1 -r 0.02 -i 8
Profiling
# script `tree.py' helps you understand J.DepP's machine-friendly parser output. > jdepp_pa < test.sent | nkf --utf8 | to_tree.py # When you input the parser output for sentences w/ full annotations [-I 2], # it emphasizes incorrect dependency arcs with red and the correct head id > jdepp_pa -I 2 -v -1 < dev.JDP | nkf --utf8 | to_tree.py | less -R # If you have an issue in rendering wide characters on Terminal of Mac OS X, # try SIMBL plugin `TerminalEastAsianAmbiguousClearer'
NOTE: We highly recommend you to use a (default) passive aggressive algorithm [-l 0] to train classifiers for parsers, since its training speed is order of magnitude faster than SVM/MaxEnt and the accuracy of the resulting models are comparable to SVM.
The following table lists the statistics of models referred in usage section. The experiments were conducted on Mac OS X 10.5 over Intel Xeon E5462 3.2Ghz CPU with 32GB main memory.
The parser accuracy is measured on the standard data-set (Kyoto University Text Corpus version 4.0; training: 9501<01-11>.KNP and 95<01-08>ED.KNP, testing: 9501<14-17>.KNP and 95<10-12>ED.KNP) [1]. Testing shows average parsing time (sec.) per POS-tagged/bunsetsu-segmented sentence in Mainichi news articles (EUC-JP) with J.DepP.
| Model ID | Algorithm [-p] | opal option | pecco option | Dep. Acc. (%) | Sent. Acc. (%) | Training [s] | Testing [sent./s] |
|---|---|---|---|---|---|---|---|
| pa_pl0 | linear [4] | -t 0 -l 2 -c 1.0e-3 -i 40 | n/a | 89.48 | 49.18 | 4.7 | 91468 |
| pa_pl2 | linear [4] | -t 1 -d 2 -c 5.0e-5 -i 40 | -t 0 -r 0.005 | 91.77 | 56.70 | 116.6 | 14950 |
| pa_pl2 (default) | linear [4] | (-t 1 -d 2 -c 5.0e-5 -i 40) | (-t 0 -s 0.02) | 91.74 | 56.53 | 116.6 | 26263 |
| pa_pt3 | tournament [5] | -t 1 -d 3 -c 1.0e-7 -i 40 | -t 0 -r 0.1 | 92.02 | 56.98 | 43684.8 | 113 |
You can further speed up a parser with a classifier (d≥3) by building a larger feature sequence trie from possible feature vectors generated by using the parser itself. The cited literature [8] might be helpful to tune hyper-parameters in training models with SVM/MaxEnt (they are trained with USE_EMNLP_FEAT option, though).
We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.
Read as you want; J. Depp refers to a different (and famous) figure outside Japan; I wish that non-Japanese researchers could not find this useless parse (for them) accidentally. If you like him, you can call the parser `Johnny'.