J.DepP is a C++ implementation of Japanese dependency parsing algorithms [1,2,3,4]. It takes a raw sentence as input and performs word segmentation, POS tagging (thanks to MeCab), bunsetsu chunking and dependency parsing. Syntactic parsers have been believed to be (significantly) slower than front-end part-of-speech taggers, and it is rarely utilized in industry that needs to handle massive texts (e.g., microblogs). The inefficiency of parsers is, however, just because researchers paid attention mostly to accuracy and have not seriously pursued an efficient implementation. J.DepP is meant for those who want to parse massive texts (e.g., entire blog feeds or microblogs); J.DepP is even faster than most of the front-end morphological analyzer (parsing >10000 sentences per second), while achieving the state-of-the-art parsing accuracy.
If you make use of J.DepP for research or commercial purposes, the reference (optional) will be:
N. Yoshinaga and M. Kitsuregawa. A Self-adaptive Classifier for Efficient Text-stream Processing. Proc. COLING 2014, pp. 1091--1102. 2014 (used for testing a parser)
N. Yoshinaga and M. Kitsuregawa. Kernel Slicing: Scalable Online Training with Conjunctive Features. Proc. COLING 2010 (oral), pp. 1245--1253. 2010. (used for training a parser)
N. Yoshinaga and M. Kitsuregawa. Polynomial to Linear: Efficient Classification with Conjunctive Features. Proc. EMNLP 2009, pp. 1542--1551. 2009. A longer journal version is here. (used for testing a parser)
[-l 0]
(partial: ~92.1% and complete: ~57.7% on news articles): you can train this parser in a minute; training the chunker and dependency parser from the training split of Kyoto University Text Corpus takes just 11.5s and 35.9s on the above MacBook Air, respectively.NOTE: The accuracy of J.DepP (or other statistical parsers) depends on the quality and quantity of the training corpus, so J.DepP achieves the state-of-the-art accuracy when the compared parsers are trained with the same corpus. By default, J.DepP uses freely-available Kyoto-University and NTT Blog Corpus to train a model, in order to isolate users from issues related to the corpus license. The resulting J.DepP is faster but less accurate than J.DepP trained with Kyoto University Text Corpus.
The best accuracy will be obtained by using a larger corpus and richer features for training:
> configure --with-corpus=kyoto+knbc --with-classifier=3rdPolyPMT
> make model
If you do not have Kyoto University Text Corpus but need this accurate model for research or personal purposes, e-mail me (see AUTHORS for the address).
[-p 0]
, cascaded chunking [2] [-p 1]
, backward [1] [-p 2]
, and tournament [4] [-p 3]
algorithms are implemented. You may want to stick to shift-reduce parser [-p 0]
(default) for practical purposes since it is not only the most efficient among all (O(n)) but also the most accurate (partly because features are tuned for this algorithm).License: GNU GPLv2, LGPLv2.1, and BSD; or e-mail me for other licenses you want.
> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/jdepp-latest.tar.gz
> tar zxvf jdepp-latest.tar.gz
> cd jdepp-YYYY-MM-DD
# 1) train a parser with Kyoto-University and NTT Blog (KNB) Corpus (default)
> configure
# or train a parser with Kyoto University Text Corpus (KyotoCorpus4.0 required)
> configure --with-corpus=kyoto
> ln -s PATH_TO_KYOTO_CORPUS/KyotoCorpus4.0
# or train a parser with Kyoto University Text Corpus (KyotoCorpus4.0 required)
# and KNB Corpus (CaboCha seems to use these corpora for training)
> configure --with-corpus=kyoto+knbc
> ln -s PATH_TO_KYOTO_CORPUS/KyotoCorpus4.0
# or train a parser with Kyoto University Text Corpus w/o Mainichi news articles
# Caveats: this option changes the feature set, so models trained w/o this option
# are not compatible with the resulting jdepp binary
> configure --with-corpus=kyoto-partial --disable-autopos-train
# 2) make model using the entire corpus (for slightly better accuracy)
# Caveats: make calls scripts in tools/, which needs python3 (2022-03-18 or later)
> make model && make install
# or make model using a part (standard training split, if any) of the corpus,
# if you want to know the accuracy of the installed parser
> make model-benchmark && make install
# See usage:training for other configuration options in building a model.
## (optional) MaxEnt [-l 2] requires Tsuruoka's MaxEnt implementation.
> wget http://www.logos.ic.i.u-tokyo.ac.jp/~tsuruoka/maxent/maxent-3.0.tar.gz
> cd src && ln -s ../maxent-X.Y/*.{cpp,h} . && cd ..
> configure --enable-maxent
For Mac OS X users: try port jdepp via MacPorts (special thanks to @hjym_u); it will build a standalone parser (--enable-standalone
) using KNB corpus (--with-coprus=knbc
) with auto POSs given by MeCab/jumandic (default).
configure --disable-autopos-train
), Darts (by Taku Kudo) or darts-clone (by Susumu Yata; recommended) can be used to store a feature dictionary. TinySVM or Tsuruoka's MaxEnt Estimator can be used to train an SVM / MaxEnt classifier.--with-corpus=knbc
).to_chunk.py
or to_tree.py
.parse_from_postagged ()
and read_result ()
for SWIG bindings.curl
from -O
to -LO
to follow redirect in retrieving KNBC corpus (thanks: Ahmed Fasih).IOBUF_SIZE
(thanks: Kay).to_chunk.py
with mecab-unidic (outputs include empty fields) (thanks: Ahmed Fasih).parse_tostr ()
and parse_from_postagged_tostr ()
.configure --with-corpus=universal
; experimental)parse_tostr ()
and parse_from_postagged_tostr ()
(thanks to Dr. Shinzato).configure --with-corpus=kyoto
to train with Kyoto University Text Corpus).to_chunk.py
(x1.5) and to_tree.py
(x1.2).configure -with-corpus=kyoto+knbc
).[-i STR]
to jdepp
, to_chunk.py
and to_tree.py
, to ignore lines starting with STR.make model
now takes advantage of all the corpus for training (improve the accuracy slightly).make model-benchmark
performs training and testing with standard splitting (same as make model
in the previous versions).> cd jdepp-2013-01-23
> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/jdepp.patch
> patch -p1 < jdepp.patch
configure --with-mecab-dict=UNI
) (experimental).configure -with-corpus=kyoto-partial
).if-else
, and place assertion to guarantee valid feature indices.[-q]
to to_chunk.py
and to_tree.py
; display only incorrectly chunked/parsed sentences.[-t 1]
(for a sentence with less than three morphemes)to_chunk.py
and to_tree.py
; more pretty printing.[-v -1]
for [-I 0]
(for [-l 0,1]
, standard sigmoid function is used to normalize a margin [-l 0,1]
).tool/to_tree.py
)configure --with-corpus=kyoto|knbc && make model
).configure --with-mecab-dict=NAIST-J
).[-I 1,2]
.to_chunk.py
)to_tree.py
)replace_pos.py
generates training data with auto POSs from the one with gold POSs).[-meopt=SGD|OWLQN|LBFGS]
.[-oopt=P|PA1|PA2, -oave=0|1]
.--
' in command-line).Typing ./jdepp -h
shows the following usage information.
J.DepP - Japanese Dependency Parser
Copyright (c) 2008-2012 Naoki Yoshinaga
Usage: jdepp [options] -- [learner options] -- [chunker classifier options] -- [parser classifier options] < test
test test file
Optional parameters in training / testing:
-t, --type=TYPE select running mode of J.DepP
0 - learn
* 1 - parse
2 - both
3 - cache
-e, --encoding=TYPE select encoding of input
* 0 - UTF-8
1 - EUC-JP
-i, --ignore=STR ignore input line starting with STR
-c, --corpus=FILE training corpus in JDEPP format ('train.JDP')
-m, --model-dir=DIR model directory ('/Users/ynaga/local/lib/jdepp/model/kyoto')
-p, --parser=TYPE select parsing algorithm
* 0 - shift reduce
1 - cascaded chunking
2 - backward
3 - tournament
-I, --input-format=TYPE select type of input format
* 0 - POS-tagged sentences
1 - + BUNSETSU annotation
2 - + DEPENDENCY annotation
Optional parameters in training:
-l, --learner=TYPE select type of learning library
* 0 - OPAL
1 - SVM (disabled)
2 - MaxEnt (disabled)
-n, --max-sent=INT max. # processing sentences (0: all)
Misc.:
-v, --verbose=INT verbosity level (0)
-h, --help show this help and exit
Type make model
to build a model for J.DepP; by modifying configuration parameters, you can build various models from KyotoCorpus4.0 or KNB corpus.
# with Kyoto-University and NTT Blog corpus (will be automatically downloaded before training a model)
## training a parser with auto POSs given by MeCab/jumandic (this is the default parser configuration)
> configure (--with-corpus=knbc) (--with-postagger=mecab) (--with-mecab-dict=JUMAN)
## build a parser with auto POSs given by MeCab/NAIST-jdic
> configure (--with-corpus=knbc) (--with-postagger=mecab) --with-mecab-dict=NAIST-J
## build a parser with auto POSs given by MeCab/ipadic
> configure (--with-corpus=knbc) (--with-postagger=mecab) --with-mecab-dict=IPA
## build a standalone parser with a model trained using KNB corpus (requires MeCab)
> configure (--with-corpus=knbc) --enable-standalone
# with Kyoto University Text Corpus (put KyotoCorpus4.0 in the top of J.DepP source directory)
## training a parser with gold POSs (to measure the parsing accuracy)
> configure --with-corpus=kyoto --disable-autopos-train
## training a parser with auto POSs given by JUMAN
> configure --with-corpus=kyoto --with-postagger=juman
# bracketed configuration options are default (so you can omit it)
Alternatively, you can train a parser with your own corpus in the following way.
# prepare the training data in the JDEPP format (morphological analyzer output + dependency annotation)
# you can convert training data in the KyotoCorpus format into JDEPP format as follows
# to train a parser compatible with JUMAN
> cat train.KNP | \
awk '!/^(#|\*|E)/ {$0 = $1" "$2" "($3 == "*" ? $1 : $3)" "$4" 0 "$5" 0 "$6" 0 "$7" 0 NIL"}; 1' > train.JDP
# to train a parser compatible with MeCab/jumandic
> cat train.KNP | \
awk '!/^(#|\*|E)/ {$0 = $1"\t"$4","$5","$6","$7","($3 == "*" ? $1 : $3)","$2",*"}; 1' > train.JDP
# You may want to train J.DepP with auto POSs given by the front-end POS tagger
# to avoid the accuracy drop due to POS inconsistency; the training with auto POSs
# builds a more accurate parser than the training with gold POSs.
# JUMAN
> replace_pos.py juman -b < train.KNP > train.JDP
# MeCab/jumandic
> replace_pos.py mecab -d MECAB_DIC_DIR < train.KNP > train.JDP
# MeCab/ipadic or MeCab/naist-jdic
> replace_pos.py mecab -d MeCab_DIC_DIR < train.KNP > train.JDP
# create a directory to save a model
> mkdir model
# train chunker/parser with opal [-l 0], TinySVM [-l 1], or Tsuruoka's MaxEnt [-l 2]
# you can configure [learner options] delegated to the learner
# to see the learner's options, set -h to learner options
> jdepp -t 0 -c train.JDP -I 1 < test.JDP # chunker
> jdepp -t 0 -c train.JDP -I 2 < test.JDP # parser
# typical model hyper-parameters for a chunker
# PA without kernel; ultimately fast but less accurate
> jdepp -t 0 -c train.JDP -I 1 -- -t 0 -c 0.05 -i 40
# default parameters; reasonably fast and enough accurate (recommended)
> jdepp -t 0 -c train.JDP -I 1 -- -t 1 -d 2 -c 0.00005 -i 40
# typical model hyper-parameters for training a dependency parser
# pa_pl0; PA with linear kernel; ultimately fast but less accurate
> jdepp -t 0 -c train.JDP -I 2 -- -t 0 -c 0.001 -i 40
# pa_pl2; default parameters; reasonably fast and enough accurate (recommended)
> jdepp -t 0 -c train.JDP -I 2 -- -t 1 -d 2 -c 0.00005 -i 40
# pa_pl3; PA1 with -d 3; slow but most accurate
> jdepp -t 0 -c train.JDP -I 2 -- -t 1 -d 3 -c 0.000001 -i 40
NOTE: The default parameters are tuned for training a shift-reduce parser with opal [-l 0 -p 0]
; if you want to use the other parsing algorithms [-p 1|2|3]
or estimators [-l 1|2]
, you should at least tune the regularization parameter [-c]
.
# build a feature sequence trie [to speed up a parser with d≥2 model]
# note: you will gain a significant speed-up only with d>=3 models
# 1) apply J.DepP to a part of gigantic data (hopefully in the same domain you're going to analyze)
# [the same format as train.JDP; no correct dependency annotation needed]
# 2) pass the data via [-c] to enumerate common feature sequences
# (set [classifier option] accordingly)
# example (you can first consider the use of the training data):
> jdepp -t 3 -c train.JDP -I 1 -- -- -- -t 1 # (for chunker)
> jdepp -t 3 -c train.JDP -I 2 -- -- -- -t 1 -r 0.005 # (for parser)
> jdepp -t 3 -c train.JDP -I 2 -- -- -- -t 1 -s 0.015 # (for parser)
# you can configure [chunker|parser classifier options] to chunk|parse a sentence
# to see classifier's options, set -h to classifier options
# example:
# pa_pl0 (when you trained a chunker|parser using opal [-t 0],
# opal is also used for classification)
> jdepp < test.sent
# pa_pl2 (default PKE classifier; pecco [-t 1 -s 0.015])
> jdepp -- -- -- -t 1 -s 0.015 < test.sent
# pa_pl2 (slightly slower SPLIT classifier)
> jdepp -- -- -- -t 1 -r 0.005 < test.sent
# pa_pl2 (slightly faster FST classifier [-t 2])
> jdepp -- -- -- -t 2 -s 0.015 -i 8 < test.sent
# classifier options can be omitted in testing
Profiling
# script `to_tree.py' helps you understand J.DepP's machine-friendly parser output.
> jdepp < test.sent | to_tree.py
# `to_chunk.py' and `to_tree.py' compactly visualize the parser output
# when you input sentences w/ annotations [-I 1,2],
> jdepp -I 1 -v -1 < dev.JDP | to_chunk.py -p | less -R
> jdepp -I 2 -v -1 < dev.JDP | to_tree.py -p | less -R
# If you have an issue in rendering wide characters on Terminal of Mac OS X,
# try SIMBL plugin `TerminalEastAsianAmbiguousClearer'
NOTE: We highly recommend you to use a (default) passive aggressive algorithm [-l 0]
to train classifiers for parsers, since its training speed is order of magnitude faster than SVM/MaxEnt and the accuracy of the resulting models are comparable to SVM.
The following table lists the statistics of models referred in usage section. The experiments were conducted on MacBook Air (Mid 2011), Mac OS X 10.7 over Intel Core i7 1.8Ghz CPU with 4GB main memory. Note that, for reference purpose, the parser is here configured with --disable-autopos-train
(Training with gold POSs is disabled by default, because it is not appropriate when you run the parser with a POS tagger).
The parser accuracy is measured on the standard data-set (Kyoto University Text Corpus version 4.0; training: 9501<01-11>.KNP and 95<01-08>ED.KNP, testing: 9501<14-17>.KNP and 95<10-12>ED.KNP) [1]. Testing shows throughput (# POS-tagged/bunsetsu-segmented sentences per second in Mainichi news articles (EUC-JP) parsed by J.DepP.
Model ID | Algorithm [-p] | opal option | pecco option | Dep. Acc. (%) | Sent. Acc. (%) | Training [s] | Testing [sent./s] |
---|---|---|---|---|---|---|---|
pa_pl0 | linear [3] | -t 0 -c 1.0e-3 -i 40 -P | n/a | 89.61 | 49.76 | 3.2 | 111385 |
pa_pl2 | linear [3] | -t 1 -d 2 -c 5.0e-5 -i 40 -p | -t 1 -r 0.005 -i 7 | 92.12 | 57.87 | 35.0 | 33467 |
pa_pl2 (default) | linear [3] | -t 1 -d 2 -c 5.0e-5 -i 40 -p | -t 1 -s 0.015 -i 7 | 92.09 | 57.73 | 35.0 | 41289 |
pa_pl3 | linear [3] | -t 1 -d 3 -c 1.0e-6 -i 40 -kp | -t 1 -r 0.05 | 92.29 | 58.80 | 259.8 | 3434 |
pa_pl3 | linear [3] | -t 1 -d 3 -c 1.0e-6 -i 40 -kp | -t 1 -s 0.001 | 92.22 | 58.43 | 259.8 | 8051 |
pa_pl2 stacking KNP 4.16 with juman-7.0 | linear [3] | -t 1 -d 2 -c 5.0e-5 -i 40 -p | -t 1 -s 0.015 -i 7 | 92.92 | - | - | - |
pa_pl3 stacking KNP 2.0 with juman-5.1 | linear [3] | (sorry, I forgot exact ones) | (ditto) | 93.19 | - | - | - |
You can further speed up a parser with a classifier (d≥3) by building a larger feature sequence trie from possible feature vectors generated by using the parser itself. The pecco paper might be helpful to tune hyper-parameters in training models with SVM/MaxEnt (they are trained with a smaller number of features, though).
We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.
If you train J.DepP with Kyoto University Text Corpus (--with-corpus=kyoto, --with-corpus=kyoto-partial), you can use the trained model only for research purpose (Refer to http://www.nichigai.co.jp/sales/mainichi/mainichi-data.html, in Japanese).
Read as you want; hopefully enjoy with spelling out J.; wandering around J for Johnny, Juggling, Jeering, JIT or whatever, before going to most boring `Japanese'.
The developer thanks Prof. Daisuke Kawahara for his guidance in converting KNB corpus to Kyoto University Text Corpus format.